NL-Eye: Abductive NLI for Images

要約

視覚言語モデル（VLM）ベースのボットは、床が濡れていることを検知したら、滑ることを警告するのだろうか？最近のVLMは素晴らしい能力を示しているが、その結果と原因を推論する能力はまだ未解明である。そこで、VLMの視覚的アブダクティブ推論能力を評価するために設計されたベンチマーク、NL-Eyeを紹介する。NL-Eyeは、自然言語推論(NLI)タスクを視覚領域に適応させたもので、前提画像に基づく仮説画像の妥当性を評価し、その決定を説明することをモデルに要求する。NL-Eyeは、物理的、機能的、論理的、感情的、文化的、社会的といった多様な推論カテゴリにまたがる、注意深くキュレーションされた350のトリプレット例（1,050画像）から構成される。データ・キュレーション・プロセスには、テキスト記述の作成と、テキストから画像への変換モデルを用いた画像生成という2つのステップが含まれ、いずれも高品質で難易度の高いシーンを確保するために、人間の関与が大きく必要とされた。我々の実験によれば、VLMはNL-Eyeでかなり苦戦し、しばしばランダムなベースラインレベルのパフォーマンスを示した。これは、最新のVLMのアブダクティブな推論能力の欠如を示している。NL-Eyeは、事故防止ボットや生成されたビデオ検証を含む実世界のアプリケーションのために、頑健なマルチモーダル推論が可能なVLMを開発するための重要な一歩となる。

要約(オリジナル)

Will a Visual Language Model (VLM)-based bot warn us about slipping if it detects a wet floor? Recent VLMs have demonstrated impressive capabilities, yet their ability to infer outcomes and causes remains underexplored. To address this, we introduce NL-Eye, a benchmark designed to assess VLMs’ visual abductive reasoning skills. NL-Eye adapts the abductive Natural Language Inference (NLI) task to the visual domain, requiring models to evaluate the plausibility of hypothesis images based on a premise image and explain their decisions. NL-Eye consists of 350 carefully curated triplet examples (1,050 images) spanning diverse reasoning categories: physical, functional, logical, emotional, cultural, and social. The data curation process involved two steps – writing textual descriptions and generating images using text-to-image models, both requiring substantial human involvement to ensure high-quality and challenging scenes. Our experiments show that VLMs struggle significantly on NL-Eye, often performing at random baseline levels, while humans excel in both plausibility prediction and explanation quality. This demonstrates a deficiency in the abductive reasoning capabilities of modern VLMs. NL-Eye represents a crucial step toward developing VLMs capable of robust multimodal reasoning for real-world applications, including accident-prevention bots and generated video verification.

arxiv情報

著者	Mor Ventura,Michael Toker,Nitay Calderon,Zorik Gekhman,Yonatan Bitton,Roi Reichart
発行日	2024-10-03 15:51:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

NL-Eye: Abductive NLI for Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー