Vision-Language Models Do Not Understand Negation

要約

多くの実用的なビジョン言語アプリケーションでは、自然言語を使用して特定のオブジェクトを含むが他のオブジェクトを含まない画像を取得する場合など、否定を理解するモデルが必要です。
大規模なトレーニングによる視覚言語モデル (VLM) の進歩にもかかわらず、否定を理解する能力は依然として解明されていません。
この研究は、現在の VLM が否定をどの程度理解しているのかという質問に対処します。
NegBench は、画像、ビデオ、医療データセットにわたる 18 のタスクのバリエーションと 79,000 の例にわたって否定の理解を評価するように設計された新しいベンチマークです。
このベンチマークは、多様なマルチモーダル設定における否定の理解を評価するために設計された 2 つの主要なタスク、つまり否定を伴う検索と否定キャプションを伴う多肢選択問題で構成されます。
私たちの評価では、最新の VLM は否定にかなり苦労しており、多くの場合チャンスレベルでパフォーマンスが低下していることが明らかになりました。
これらの欠点に対処するために、数百万の否定されたキャプションを含む大規模な合成データセットで CLIP モデルを微調整するデータ中心のアプローチを検討します。
このアプローチにより、否定されたクエリの再現率が 10% 向上し、否定されたキャプションを含む多肢選択式の質問の精度が 40% 向上する可能性があることを示します。

要約(オリジナル)

Many practical vision-language applications require models that understand negation, e.g., when using natural language to retrieve images which contain certain objects but not others. Despite advancements in vision-language models (VLMs) through large-scale training, their ability to comprehend negation remains underexplored. This study addresses the question: how well do current VLMs understand negation? We introduce NegBench, a new benchmark designed to evaluate negation understanding across 18 task variations and 79k examples spanning image, video, and medical datasets. The benchmark consists of two core tasks designed to evaluate negation understanding in diverse multimodal settings: Retrieval with Negation and Multiple Choice Questions with Negated Captions. Our evaluation reveals that modern VLMs struggle significantly with negation, often performing at chance level. To address these shortcomings, we explore a data-centric approach wherein we finetune CLIP models on large-scale synthetic datasets containing millions of negated captions. We show that this approach can result in a 10% increase in recall on negated queries and a 40% boost in accuracy on multiple-choice questions with negated captions.

arxiv情報

著者	Kumail Alhamoud,Shaden Alshammari,Yonglong Tian,Guohao Li,Philip Torr,Yoon Kim,Marzyeh Ghassemi
発行日	2025-01-16 09:55:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vision-Language Models Do Not Understand Negation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー