H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

要約

ラージビジョン言語モデル (LVLM) は、テキストと画像の両方を活用することで、さまざまなマルチモーダルタスクにおいて大幅な進歩を示しました。
それにもかかわらず、これらのモデルは、視覚的入力とテキスト出力の間に不一致を示すなど、幻覚に悩まされることがよくあります。
これに対処するために、物体の存在と属性における幻覚を体系的に評価する粗い粒度から細かい粒度までのベンチマークである H-POPE を提案します。
私たちの評価では、モデルはオブジェクトの存在に関して幻覚を起こしやすい傾向があり、さらに詳細な属性に関してはさらにその傾向が強いことがわかりました。
さらに、これらのモデルが出力テキストを作成するために視覚入力に依存しているかどうかを調査します。

要約(オリジナル)

By leveraging both texts and images, large vision language models (LVLMs) have shown significant progress in various multi-modal tasks. Nevertheless, these models often suffer from hallucinations, e.g., they exhibit inconsistencies between the visual input and the textual output. To address this, we propose H-POPE, a coarse-to-fine-grained benchmark that systematically assesses hallucination in object existence and attributes. Our evaluation shows that models are prone to hallucinations on object existence, and even more so on fine-grained attributes. We further investigate whether these models rely on visual input to formulate the output texts.

arxiv情報

著者	Nhi Pham,Michael Schott
発行日	2024-11-06 17:55:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

H-POPE: Hierarchical Polling-based Probing Evaluation of Hallucinations in Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー