A Closer Look at Multimodal Representation Collapse

要約

私たちは、モダリティ崩壊の基本的な理解を開発することを目指しています。これは、マルチモーダル融合のために訓練されたモデルがモダリティのサブセットのみに依存して、残りを無視する傾向がある最近観察された経験的現象です。
あるモダリティからのノイズの多い特徴が、融合ヘッドの共有ニューロンのセットを介して、別のモダリティからの予測的特徴を介して絡み合っていると、前者のモダリティの予測的特徴からの肯定的な貢献を効果的に隠し、崩壊につながると、モダリティ崩壊が起こることを示します。
さらに、クロスモーダルの知識の蒸留は、学生エンコーダのランクボトルネックを解放し、いずれかのモダリティからの予測機能に悪影響を与えることなく融合ヘッド出力を除去することにより、そのような表現を暗黙的に解き放つことを証明します。
上記の調査結果に基づいて、明示的な基盤の再割り当てを通じてモダリティの崩壊を防ぐアルゴリズムを提案し、アプリケーションが不足しているモダリティを扱うことを提案します。
複数のマルチモーダルベンチマークでの広範な実験は、当社の理論的主張を検証します。
プロジェクトページ：https：//abhrac.github.io/mmcollapse/。

要約(オリジナル)

We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.

arxiv情報

著者	Abhra Chaudhuri,Anjan Dutta,Tu Bui,Serban Georgescu
発行日	2025-05-28 15:31:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Closer Look at Multimodal Representation Collapse

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー