DH-Bench: Probing Depth and Height Perception of Large Visual-Language Models

要約

幾何学的な理解は、私たちの環境をナビゲートし、環境と対話するために不可欠です。
大規模なビジョン言語モデル (VLM) は優れた機能を示しますが、これを現実世界のシナリオに導入するには、視覚認識における同等の幾何学的理解を必要とします。
この研究では、これらのモデルの幾何学的理解に焦点を当てます。
特にシーン内のオブジェクトの深さと高さをターゲットにします。
私たちの観察により、VLM は形状やサイズなどの基本的な幾何学的特性の認識には優れていますが、物体の深さと高さを推論する際に大きな課題に直面していることが明らかになりました。
これに対処するために、合成 2D、合成 3D、現実世界のシナリオを含む一連のベンチマークデータセットを導入し、これらの側面を厳密に評価します。
これらのデータセットを使用して 17 の最先端の VLM のベンチマークを行ったところ、それらの VLM は奥行きと高さの両方の知覚に一貫して問題があることがわかりました。
私たちの重要な洞察には、VLM の深さと高さの推論機能の欠点と、これらのモデルに存在する固有のバイアスの詳細な分析が含まれます。
この研究は、実世界のアプリケーションにとって重要な幾何学的理解を強化した VLM 開発への道を開くことを目的としています。
ベンチマークのコードとデータセットは、\url{https://tinyurl.com/DH-Bench1} で入手できます。

要約(オリジナル)

Geometric understanding is crucial for navigating and interacting with our environment. While large Vision Language Models (VLMs) demonstrate impressive capabilities, deploying them in real-world scenarios necessitates a comparable geometric understanding in visual perception. In this work, we focus on the geometric comprehension of these models; specifically targeting the depths and heights of objects within a scene. Our observations reveal that, although VLMs excel in basic geometric properties perception such as shape and size, they encounter significant challenges in reasoning about the depth and height of objects. To address this, we introduce a suite of benchmark datasets encompassing Synthetic 2D, Synthetic 3D, and Real-World scenarios to rigorously evaluate these aspects. We benchmark 17 state-of-the-art VLMs using these datasets and find that they consistently struggle with both depth and height perception. Our key insights include detailed analyses of the shortcomings in depth and height reasoning capabilities of VLMs and the inherent bias present in these models. This study aims to pave the way for the development of VLMs with enhanced geometric understanding, crucial for real-world applications. The code and datasets for our benchmarks will be available at \url{https://tinyurl.com/DH-Bench1}.

arxiv情報

著者	Shehreen Azad,Yash Jain,Rishit Garg,Yogesh S Rawat,Vibhav Vineet
発行日	2024-08-21 16:16:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DH-Bench: Probing Depth and Height Perception of Large Visual-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー