Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

要約

大規模視覚言語モデル（LVLM）は、マルチモーダル入力をキャプチャし、推論するための素晴らしい能力を実証してきた。しかし、これらのモデルは、視覚と言語の構成要素間で表現される知識の矛盾から生じるパラメトリック知識衝突を起こしやすい。本稿では、$textbf{cross-modality parametric knowledge conflict}$問題を正式に定義し、それを検出、解釈、緩和する体系的なアプローチを示す。我々は、視覚的な答えとテキスト的な答えの間の衝突を識別するパイプラインを導入し、モデルサイズに関係なく、最近のLVLMにおいてモダリティ間の衝突率が持続的に高いことを示す。さらに、これらの競合が推論プロセスをどのように妨害するかを調査し、競合するサンプルを他のサンプルから識別するための対照的なメトリックを提案する。これらの洞察に基づき、答えの確信度に基づいて、確信度の低いモダリティ成分から推論された望ましくないロジットを除去する、新しい動的な対照的デコーディング手法を開発する。ロジットを提供しないモデルに対しては、矛盾を緩和する2つのプロンプトベースの戦略も導入する。我々の手法は、ViQuAEとInfoSeekの両データセットにおいて、有望な精度向上を達成した。具体的には、LLaVA-34Bを用いた場合、我々の提案する動的コントラスト復号化により、平均2.24%の精度向上を達成した。

要約(オリジナル)

Large Vision-Language Models (LVLMs) have demonstrated impressive capabilities for capturing and reasoning over multimodal inputs. However, these models are prone to parametric knowledge conflicts, which arise from inconsistencies of represented knowledge between their vision and language components. In this paper, we formally define the problem of $\textbf{cross-modality parametric knowledge conflict}$ and present a systematic approach to detect, interpret, and mitigate them. We introduce a pipeline that identifies conflicts between visual and textual answers, showing a persistently high conflict rate across modalities in recent LVLMs regardless of the model size. We further investigate how these conflicts interfere with the inference process and propose a contrastive metric to discern the conflicting samples from the others. Building on these insights, we develop a novel dynamic contrastive decoding method that removes undesirable logits inferred from the less confident modality components based on answer confidence. For models that do not provide logits, we also introduce two prompt-based strategies to mitigate the conflicts. Our methods achieve promising improvements in accuracy on both the ViQuAE and InfoSeek datasets. Specifically, using LLaVA-34B, our proposed dynamic contrastive decoding improves an average accuracy of 2.24%.

arxiv情報

著者	Tinghui Zhu,Qin Liu,Fei Wang,Zhengzhong Tu,Muhao Chen
発行日	2024-10-04 17:59:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Unraveling Cross-Modality Knowledge Conflict in Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー