Shapley Value-based Contrastive Alignment for Multimodal Information Extraction

要約

ソーシャルメディアの台頭とマルチモーダルコミュニケーションの急激な成長により、マルチモーダル情報抽出 (MIE) のための高度な技術が必要になっています。
しかし、既存の方法論は主に画像とテキストの直接的な対話に依存しており、このパラダイムは画像とテキストの間の意味論的およびモダリティのギャップにより重大な課題に直面することがよくあります。
この論文では、大規模なマルチモーダルモデル (LMM) を利用して説明的なテキストコンテキストを生成し、これらのギャップを埋める、画像-コンテキスト-テキストインタラクションの新しいパラダイムを紹介します。
このパラダイムに沿って、我々は、コンテキストとテキストとコンテキストと画像の両方のペアを位置合わせする、新しい Shapley Value-based Contrastive Alignment (Shap-CA) 方法を提案します。
Shap-CA は最初に、協力ゲーム理論の Shapley 価値概念を適用して、全体的な意味論とモダリティの重複に対するコンテキスト、テキスト、画像のセット内の各要素の個々の貢献を評価します。
この定量的評価に続いて、対比学習戦略が採用され、コンテキストとテキスト/画像のペア内のインタラクティブな貢献が強化され、同時にこれらのペアにわたる影響が最小限に抑えられます。
さらに、選択的クロスモーダル融合のための適応融合モジュールを設計します。
4 つの MIE データセットにわたる広範な実験により、私たちの方法が既存の最先端の方法を大幅に上回ることが実証されました。

要約(オリジナル)

The rise of social media and the exponential growth of multimodal communication necessitates advanced techniques for Multimodal Information Extraction (MIE). However, existing methodologies primarily rely on direct Image-Text interactions, a paradigm that often faces significant challenges due to semantic and modality gaps between images and text. In this paper, we introduce a new paradigm of Image-Context-Text interaction, where large multimodal models (LMMs) are utilized to generate descriptive textual context to bridge these gaps. In line with this paradigm, we propose a novel Shapley Value-based Contrastive Alignment (Shap-CA) method, which aligns both context-text and context-image pairs. Shap-CA initially applies the Shapley value concept from cooperative game theory to assess the individual contribution of each element in the set of contexts, texts and images towards total semantic and modality overlaps. Following this quantitative evaluation, a contrastive learning strategy is employed to enhance the interactive contribution within context-text/image pairs, while minimizing the influence across these pairs. Furthermore, we design an adaptive fusion module for selective cross-modal fusion. Extensive experiments across four MIE datasets demonstrate that our method significantly outperforms existing state-of-the-art methods.

arxiv情報

著者	Wen Luo,Yu Xia,Shen Tianshu,Sujian Li
発行日	2024-07-25 08:15:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Shapley Value-based Contrastive Alignment for Multimodal Information Extraction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー