Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

要約

Large Vision-Language Model (LVLM) は大幅に進歩し、視覚認識と言語理解を結び付けて、一貫性があるだけでなく文脈に合わせたコンテンツを生成します。
成功にもかかわらず、LVLM は依然として物体幻覚の問題に悩まされています。モデルは、画像内に存在しない物体を含む、もっともらしいが不正確な出力を生成します。
この問題を軽減するために、元の視覚入力と歪んだ視覚入力から得られる出力分布を対比する、シンプルでトレーニング不要の手法である Visual Contrastive Decoding (VCD) を導入します。
提案された VCD は、物体幻覚の 2 つの重要な原因である統計的バイアスと単峰性事前分布への過度の依存を効果的に軽減します。
この調整により、生成されたコンテンツが視覚的な入力に厳密に基づいて作成され、状況に応じて正確な出力が得られます。
私たちの実験では、VCD は、追加のトレーニングや外部ツールの使用を行わなくても、さまざまな LVLM ファミリ全体で物体幻覚の問題を大幅に軽減できることを示しています。
VCD は、物体の幻覚を軽減するだけでなく、一般的な LVLM ベンチマークでも優れており、その幅広い適用可能性を強調しています。

要約(オリジナル)

Large Vision-Language Models (LVLMs) have advanced considerably, intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success, LVLMs still suffer from the issue of object hallucinations, where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue, we introduce Visual Contrastive Decoding (VCD), a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors, two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs, resulting in contextually accurate outputs. Our experiments show that VCD, without either additional training or the usage of external tools, significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations, VCD also excels in general LVLM benchmarks, highlighting its wide-ranging applicability.

arxiv情報

著者	Sicong Leng,Hang Zhang,Guanzheng Chen,Xin Li,Shijian Lu,Chunyan Miao,Lidong Bing
発行日	2023-11-28 16:26:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー