Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

要約

大規模ビジョン言語モデル (LVLM) は、さまざまなビジョン言語タスクで目覚ましい結果を達成しました。
しかし、LVLM は有望なパフォーマンスを示しているにもかかわらず、言語の偏りによって引き起こされる幻覚に悩まされており、画像への集中力が低下し、視覚的理解力が低下します。
この偏りの主な理由を 2 つ特定します。 1. LLM の事前トレーニング段階とマルチモーダルアライメント段階の間でトレーニングデータのスケールが異なる。
2. テキストデータの短期依存性による学習された推論バイアス。
したがって、我々は、多モード二重注意メカニズム（MDA）とソフトイメージガイダンス（IFG）を使用してLVLMの言語バイアスに対処するように設計された体系的なフレームワークであるLACINGを提案します。
具体的には、MDA は、モデル全体にわたる視覚入力の統合を強化する並列二重注意メカニズムを導入します。
IFG は、トレーニングおよび推論中に視覚入力を置き換える学習可能なソフト視覚プロンプトを導入し、LVLM にテキスト入力の優先順位を強制するように設計されています。
次に、IFG は、隣接するテキスト入力に対するモデルの過度の依存を軽減するために、ソフトビジュアルプロンプトを使用した新しいデコード戦略をさらに提案します。
包括的な実験により、私たちの方法は、追加のトレーニングリソースやデータを必要とせずに、LVLMの言語バイアスを効果的に軽減し、視覚的理解を強化し、幻覚を軽減することが実証されています。
コードとモデルは [lacing-lvlm.github.io](https://lacing-lvlm.github.io) で入手できます。

要約(オリジナル)

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (IFG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. IFG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, IFG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model’s over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data. The code and model are available at [lacing-lvlm.github.io](https://lacing-lvlm.github.io).

arxiv情報

著者	Haozhe Zhao,Shuzheng Si,Liang Chen,Yichi Zhang,Maosong Sun,Mingjia Zhang,Baobao Chang
発行日	2024-11-21 16:33:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー