I Know About ‘Up’! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction

要約

視覚言語モデル (VLM) は、堅牢なマルチモーダル情報統合、視覚的推論機能、およびコンテキスト認識により、さまざまなタスク、特に視覚的推論タスクに不可欠です。
ただし、既存の \VLM{} の視覚的空間推論機能は不十分なことが多く、左右の区別などの基本的なタスクでも困難を伴います。
これに対処するために、VLMS の視覚空間推論能力を強化するように設計された \ours{} モデルを提案します。
ZeroVLM は、入力画像のさまざまなビューを取得するための 3D 再構成モデルである Zero-1-to-3 を採用し、視覚的空間推論をさらに向上させるためのプロンプトメカニズムを組み込んでいます。
4 つの視覚空間推論データセットに関する実験結果では、\ours{} が最大 19.48% の精度向上を達成したことを示しており、これは、ZeroVLM の 3D 再構成とプロンプトメカニズムの有効性を示しています。

要約(オリジナル)

Visual Language Models (VLMs) are essential for various tasks, particularly visual reasoning tasks, due to their robust multi-modal information integration, visual reasoning capabilities, and contextual awareness. However, existing \VLMs{}’ visual spatial reasoning capabilities are often inadequate, struggling even with basic tasks such as distinguishing left from right. To address this, we propose the \ours{} model, designed to enhance the visual spatial reasoning abilities of VLMS. ZeroVLM employs Zero-1-to-3, a 3D reconstruction model for obtaining different views of the input images and incorporates a prompting mechanism to further improve visual spatial reasoning. Experimental results on four visual spatial reasoning datasets show that our \ours{} achieves up to 19.48% accuracy improvement, which indicates the effectiveness of the 3D reconstruction and prompting mechanisms of our ZeroVLM.

arxiv情報

著者	Zaiqiao Meng,Hao Zhou,Yifang Chen
発行日	2024-09-12 11:17:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

I Know About ‘Up’! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー