Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models

要約

大規模視覚言語モデル (LVLM) は、入力された視覚コンテンツと相関するもっともらしい応答を生成する優れた機能を示していますが、依然として、生成されたテキストが視覚コンテンツを不正確に反映する幻覚に悩まされています。
これに対処するために、最近のアプローチでは、コントラストデコーディングを適用して、元のサンプルと視覚的に歪んだサンプルで出力分布を対比することでモデルの応答を校正し、トレーニング不要の方法で幻覚軽減が期待できることを実証しています。
ただし、視覚入力内の情報を変更する可能性については十分に調査されていないため、視覚コントラストデコーディングの動作についてのより深い調査は非常に興味深いものです。
この論文では、まず、画像のダウンサンプリングや編集など、視覚コンテンツを変更するためのコントラストデコードのさまざまな方法を検討します。
画像をダウンサンプリングすると詳細なテキスト情報が削減されますが、編集すると画像に新しいコンテンツが生成され、視覚的に対照的なサンプルとして新しい側面が提供されます。
さまざまな対照的なサンプルを使用することによるメリットをさらに研究するために、エントロピーや分布距離などの確率レベルの指標を分析します。
興味深いことに、幻覚を軽減するこれらのサンプルの効果は、LVLM とベンチマークによって大きく異なります。
私たちの分析に基づいて、コントラストのあるサンプルを結合するためのシンプルかつ効果的な方法を提案し、さまざまなシナリオにコントラストデコーディングを適用するための実用的なソリューションを提供します。
提案された融合方法をさまざまなベンチマーク間で検証するために、広範な実験が行われます。

要約(オリジナル)

While large vision-language models (LVLMs) have shown impressive capabilities in generating plausible responses correlated with input visual contents, they still suffer from hallucinations, where the generated text inaccurately reflects visual contents. To address this, recent approaches apply contrastive decoding to calibrate the model’s response via contrasting output distributions with original and visually distorted samples, demonstrating promising hallucination mitigation in a training-free manner. However, the potential of changing information in visual inputs is not well-explored, so a deeper investigation into the behaviors of visual contrastive decoding is of great interest. In this paper, we first explore various methods for contrastive decoding to change visual contents, including image downsampling and editing. Downsampling images reduces the detailed textual information while editing yields new contents in images, providing new aspects as visual contrastive samples. To further study benefits by using different contrastive samples, we analyze probability-level metrics, including entropy and distribution distance. Interestingly, the effect of these samples in mitigating hallucinations varies a lot across LVLMs and benchmarks. Based on our analysis, we propose a simple yet effective method to combine contrastive samples, offering a practical solution for applying contrastive decoding across various scenarios. Extensive experiments are conducted to validate the proposed fusion method among different benchmarks.

arxiv情報

著者	Yi-Lun Lee,Yi-Hsuan Tsai,Wei-Chen Chiu
発行日	2024-12-09 18:57:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Delve into Visual Contrastive Decoding for Hallucination Mitigation of Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー