Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?

要約

ラージビジョンランゲージモデル (LVLM) は、最近、画像キャプションや多くの画像理解タスク (視覚的な質問応答など) の最先端技術を劇的に押し上げています。
しかし、LVLM はしばしば \textit{幻覚} を起こし、画像には見つからない概念について言及するキャプションを生成します。
これらの幻覚は LVLM の信頼性を損ない、おそらく、LVLM の普及に対する主な障害の 1 つです。
最近の研究では、グラウンディング目標（画像領域またはオブジェクトをテキストスパンに明示的に位置合わせする目標）を追加すると、LVLM 幻覚の量が減少することが示唆されています。
この主張は直観的ではありますが、軽減効果は確立されているため経験的に正当化されるものではなく、(i) LVLM トレーニングで広く使用されているデータ (つまり MSCOCO) に依存しており、(ii) 幻覚を測定しているという欠陥のある評価プロトコルを使用していると主張します。
自由形式のキャプション生成ではなく、質問応答を介して。
対照的に、この研究では、オープン世代の LVLM 幻覚をより現実的に捉える評価プロトコルの下で、きめの細かい物体接地が LVLM 幻覚に及ぼす影響の最初の体系的な分析を提供します。
3 つのバックボーン LLM に対する広範な実験により、グラウンディング対物レンズがオープンキャプション生成における物体の幻覚にほとんど影響を与えないことが明らかになりました。

要約(オリジナル)

Large vision-language models (LVLMs) have recently dramatically pushed the state of the art in image captioning and many image understanding tasks (e.g., visual question answering). LVLMs, however, often \textit{hallucinate} and produce captions that mention concepts that cannot be found in the image. These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption. Recent work suggests that addition of grounding objectives — those that explicitly align image regions or objects to text spans — reduces the amount of LVLM hallucination. Although intuitive, this claim is not empirically justified as the reduction effects have been established, we argue, with flawed evaluation protocols that (i) rely on data (i.e., MSCOCO) that has been extensively used in LVLM training and (ii) measure hallucination via question answering rather than open-ended caption generation. In this work, in contrast, we offer the first systematic analysis of the effect of fine-grained object grounding on LVLM hallucination under an evaluation protocol that more realistically captures LVLM hallucination in open generation. Our extensive experiments over three backbone LLMs reveal that grounding objectives have little to no effect on object hallucination in open caption generation.

arxiv情報

著者	Gregor Geigle,Radu Timofte,Goran Glavaš
発行日	2024-06-20 16:56:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー