Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

要約

Large Vision-Language Model (LVLM) は、マルチモーダル入力をキャプチャして推論するための優れた機能を実証しています。
ただし、これらのモデルは、視覚コンポーネントと言語コンポーネントの間で表現された知識の不一致から生じる、パラメトリックな知識の競合が発生する傾向があります。
この論文では、$\textbf{クロスモダリティパラメトリック知識衝突}$ の問題を正式に定義し、それらを検出、解釈、軽減するための体系的なアプローチを提示します。
視覚的な回答とテキストによる回答の間の競合を特定するパイプラインを導入しました。これにより、モデルのサイズに関係なく、最近の LVLM ではモダリティ間で一貫して高い競合率が示されています。
これらの競合が推論プロセスにどのように干渉するかをさらに調査し、競合するサンプルを他のサンプルから識別するための対照的な指標を提案します。
これらの洞察に基づいて、応答の信頼性に基づいて、信頼性の低いモダリティコンポーネントから推測される望ましくないロジットを除去する、新しい動的対比復号化方法を開発します。
ロジットを提供しないモデルについては、競合を軽減するための 2 つのプロンプトベースの戦略も導入します。
私たちの手法は、ViQuAE データセットと InfoSeek データセットの両方で精度の確実な向上を実現します。
具体的には、LLaVA-34B を使用することで、私たちが提案する動的コントラスト復号化により、平均精度が 2.24% 向上します。

要約(オリジナル)

Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of $\textbf{cross-modality parametric knowledge conflict}$ and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.

arxiv情報

著者	Tinghui Zhu,Qin Liu,Fei Wang,Zhengzhong Tu,Muhao Chen
発行日	2024-10-11 15:07:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unraveling Cross-Modality Knowledge Conflicts in Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー