Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance

要約

Large Vision-Language Model (LVLM) の進歩により、画像内に存在しないオブジェクトが幻覚を起こす傾向があるという重大な問題がますます浮き彫りになっています。
この問題に対処するために、これまでの研究では、特別に厳選されたデータセットまたは強力な LLM (GPT-3.5 など) を使用して LVLM の出力を修正することに焦点を当てていました。
ただし、これらのアプローチでは、生成後にモデルの出力を修正するために、高価なトレーニング/微調整、または高度な LLM への API アクセスが必要になります。
この論文では、Mitigating halucinAtion via classifieR-Free guIdaNcE (MARINE) と呼ばれるフレームワークを導入することでこの課題に取り組みます。このフレームワークはトレーニング不要、API 不要で、生成プロセス中にオブジェクトの幻覚を効果的かつ効率的に軽減できます。
具体的には、MARINE は既存のオープンソースビジョンモデルを統合することで LVLM の視覚的コンテキストを強化し、分類子を使用しないガイダンスを採用して追加のオブジェクトグラウンディング機能を組み込んで LVLM の生成の精度を向上させます。
さまざまな評価指標を備えた $6$ の人気の LVLM にわたる包括的な評価を通じて、既存の微調整ベースの手法をも上回る MARINE の有効性を実証しました。
注目すべきことに、GPT-4V によって評価されるように、幻覚を軽減するだけでなく、LVLM の世代の詳細度も向上します。

要約(オリジナル)

The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs (e.g., GPT-3.5) to rectify the outputs of LVLMs. However, these approaches require either expensive training/fine-tuning or API access to advanced LLMs to correct the model’s output post-generation. In this paper, we tackle this challenge by introducing a framework called Mitigating hallucinAtion via classifieR-Free guIdaNcE (MARINE), which is both training-free and API-free, and can effectively and efficiently reduce object hallucinations during the generation process. Specifically, MARINE enriches the visual context of LVLMs by integrating existing open-source vision models, and employs classifier-free guidance to incorporate the additional object grounding features to improve the precision of LVLMs’ generations. Through comprehensive evaluations across $6$ popular LVLMs with diverse evaluation metrics, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it not only reduces hallucinations but also improves the detailedness of LVLMs’ generations, as assessed by GPT-4V.

arxiv情報

著者	Linxi Zhao,Yihe Deng,Weitong Zhang,Quanquan Gu
発行日	2024-02-13 18:59:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mitigating Object Hallucination in Large Vision-Language Models via Classifier-Free Guidance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー