Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

要約

視覚言語モデルにおける幻覚は、特に長いキャプションの生成において、その信頼性に重大な課題をもたらします。
現在の方法では、これらの幻覚を正確に特定して軽減するには至っていません。
この問題に対処するために、幻覚トークンの正確な位置特定とペナルティを通じて幻覚の生成を抑制するように設計された新しい教師なし学習フレームワークである ESREAL を導入します。
最初に、ESREAL は生成されたキャプションに基づいて再構成された画像を作成し、その対応する領域を元の画像の領域と位置合わせします。
この意味論的な再構築は、生成されたキャプション内のトークンレベルの幻覚の存在と種類の両方を識別するのに役立ちます。
その後、ESREAL は、幻覚の種類に基づいて、整列された領域の意味上の類似性を評価することにより、トークンレベルの幻覚スコアを計算します。
最後に、ESREAL は近接ポリシー最適化アルゴリズムを採用しており、トークンレベルの幻覚スコアに応じて幻覚トークンに選択的にペナルティを課します。
私たちのフレームワークは、特に LLaVA、InstructBLIP、mPLUG-Owl2 の幻覚を CHAIR メトリクスで 32.81%、27.08%、7.46% 削減します。
この改善は、画像とテキストのペアを必要とせず、画像自体から得られる信号のみによって達成されます。

要約(オリジナル)

Hallucinations in vision-language models pose a significant challenge to their reliability, particularly in the generation of long captions. Current methods fall short of accurately identifying and mitigating these hallucinations. To address this issue, we introduce ESREAL, a novel unsupervised learning framework designed to suppress the generation of hallucinations through accurate localization and penalization of hallucinated tokens. Initially, ESREAL creates a reconstructed image based on the generated caption and aligns its corresponding regions with those of the original image. This semantic reconstruction aids in identifying both the presence and type of token-level hallucinations within the generated caption. Subsequently, ESREAL computes token-level hallucination scores by assessing the semantic similarity of aligned regions based on the type of hallucination. Finally, ESREAL employs a proximal policy optimization algorithm, where it selectively penalizes hallucinated tokens according to their token-level hallucination scores. Our framework notably reduces hallucinations in LLaVA, InstructBLIP, and mPLUG-Owl2 by 32.81%, 27.08%, and 7.46% on the CHAIR metric. This improvement is achieved solely through signals derived from the image itself, without the need for any image-text pairs.

arxiv情報

著者	Minchan Kim,Minyeong Kim,Junik Bae,Suhwan Choi,Sungkyung Kim,Buru Chang
発行日	2024-03-26 15:14:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Exploiting Semantic Reconstruction to Mitigate Hallucinations in Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー