DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests

要約

Large Vision-Language Model (LVLM) は、言語モデルを視覚的に理解できるように強化し、マルチモーダルな推論を可能にします。
ただし、テキストデータと視覚データの間にモダリティのギャップがあるため、テキスト事前依存への過度の依存、幻覚、複雑な視覚的推論の能力の制限など、多くの場合、重大な課題に直面します。
LVLM の視覚的推論を評価する既存のベンチマークは、多くの場合、概略図または合成画像と、機械が生成した不正確な説明に依存しています。
モダリティのギャップを埋めるために、複雑な現実世界のシナリオにおける視覚的な思考連鎖推論を評価するための運転理論テストから派生した新しいベンチマークである DrivingVQA を紹介します。
専門家が作成した 3,931 個の多肢選択問題と、推論プロセスに関連するエンティティに基づいた説明が組み込まれています。
私たちはこのデータセットを利用して、複雑な視覚的シナリオを推論する LVLM の能力について広範な研究を実行します。
私たちの実験では、オープンソースおよびプロプライエタリな LVLM は、ゼロショット設定では視覚的な思考連鎖推論に苦労していることが明らかになりました。
私たちは、視覚的推論を向上させるために関連するエンティティを活用するトレーニング戦略を調査します。
特に、これらのエンティティに関連付けられたトリミングされた領域の画像トークンを推論すると、パフォーマンスが最大 7\% 向上することが観察されます。

要約(オリジナル)

Large vision-language models (LVLMs) augment language models with visual understanding, enabling multimodal reasoning. However, due to the modality gap between textual and visual data, they often face significant challenges, such as over-reliance on text priors, hallucinations, and limited capacity for complex visual reasoning. Existing benchmarks to evaluate visual reasoning in LVLMs often rely on schematic or synthetic images and on imprecise machine-generated explanations. To bridge the modality gap, we present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios. It offers 3,931 expert-crafted multiple-choice problems and interleaved explanations grounded with entities relevant to the reasoning process. We leverage this dataset to perform an extensive study of LVLMs’ ability to reason about complex visual scenarios. Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings. We investigate training strategies that leverage relevant entities to improve visual reasoning. Notably, we observe a performance boost of up to 7\% when reasoning over image tokens of cropped regions tied to these entities.

arxiv情報

著者	Charles Corbière,Simon Roburin,Syrielle Montariol,Antoine Bosselut,Alexandre Alahi
発行日	2025-01-08 18:31:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー