Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

要約

深層マルチモーダル学習は、対照学習を活用してモダリティ間の明示的な 1 対 1 の関係を捉えることで目覚ましい成功を収めています。
ただし、現実世界のデータは、単純なペアごとの関連付けを超えた共有関係を示すことがよくあります。
私たちは、マルチモーダルデータに固有の微妙な共有関係を捉えるマルチモーダルミックスアップ対比学習アプローチである M3CoL を提案します。
私たちの主な貢献は、あるモダリティからの混合サンプルを他のモダリティからの対応するサンプルと位置合わせすることによってロバストな表現を学習し、それによってそれらの間の共有関係を捕捉するミックスアップベースのコントラスト損失です。
マルチモーダル分類タスクの場合、トレーニング中の補助監視のために融合モジュールとユニモーダル予測モジュールを統合するフレームワークを導入します。これは、私たちが提案する Mixup ベースのコントラスト損失によって補完されます。
多様なデータセット (N24News、ROSMAP、BRCA、Food-101) に対する広範な実験を通じて、M3CoL が共有されたマルチモーダルな関係を効果的に捕捉し、ドメイン全体で一般化することを実証しました。
N24News、ROSMAP、BRCA では最先端の手法を上回り、Food-101 では同等のパフォーマンスを達成します。
私たちの研究は、堅牢なマルチモーダル学習のための共有関係を学習することの重要性を強調し、将来の研究に有望な道を切り開きます。

要約(オリジナル)

Deep multimodal learning has shown remarkable success by leveraging contrastive learning to capture explicit one-to-one relations across modalities. However, real-world data often exhibits shared relations beyond simple pairwise associations. We propose M3CoL, a Multimodal Mixup Contrastive Learning approach to capture nuanced shared relations inherent in multimodal data. Our key contribution is a Mixup-based contrastive loss that learns robust representations by aligning mixed samples from one modality with their corresponding samples from other modalities thereby capturing shared relations between them. For multimodal classification tasks, we introduce a framework that integrates a fusion module with unimodal prediction modules for auxiliary supervision during training, complemented by our proposed Mixup-based contrastive loss. Through extensive experiments on diverse datasets (N24News, ROSMAP, BRCA, and Food-101), we demonstrate that M3CoL effectively captures shared multimodal relations and generalizes across domains. It outperforms state-of-the-art methods on N24News, ROSMAP, and BRCA, while achieving comparable performance on Food-101. Our work highlights the significance of learning shared relations for robust multimodal learning, opening up promising avenues for future research.

arxiv情報

著者	Raja Kumar,Raghav Singhal,Pranamya Kulkarni,Deval Mehta,Kshitij Jadhav
発行日	2024-10-18 16:31:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー