Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model

要約

マルチモーダル言語モデル(MLLM)は、3次元空間を解釈し、時間的ダイナミクスを理解する能力を必要とし、実世界環境でますます実装されるようになってきている。その可能性にもかかわらず、我々のコミュニティにおける現在のトップモデルは、空間的・時間的次元を適切に理解するにはまだ不十分である。我々は、マルチモーダルLLMの3Dと時間的理解を引き出すための、シンプルでトレーニング不要、効果的で汎用的な視覚的プロンプト手法であるCoarse Correspondenceを紹介する。本手法では、軽量な追跡モデルを用いて、ビデオ内のフレーム間、あるいは画像視点間のオブジェクトの対応を見つける。最も頻度の高いオブジェクトインスタンスを選択し、画像内のユニークなIDを持つマーカーで可視化する。このシンプルなアプローチにより、ScanQA (+20.5%)やOpenEQAのサブセット(+9.7%)を含む3D理解ベンチマークや、EgoSchema (+6.0%)のような長編動画ベンチマークで、最先端の結果を達成した。また、MLLMがカメラ視点以外の記述された視点から空間について推論できるかどうかを評価するために、小さな診断データセットも作成した。ここでも、粗い対応付けは空間的な視点を取る能力を向上させるが、MLLMはこのタスクに苦戦していることが浮き彫りになった。これらの結果から、我々の単純なプロンプト作成法が、3次元的または時間的推論を必要とする下流のタスクを大幅に支援できることが示された。

要約(オリジナル)

Multimodal language models (MLLMs) are increasingly being implemented in real-world environments, necessitating their ability to interpret 3D spaces and comprehend temporal dynamics. Despite their potential, current top models within our community still fall short in adequately understanding spatial and temporal dimensions. We introduce Coarse Correspondence, a simple, training-free, effective, and general-purpose visual prompting method to elicit 3D and temporal understanding in multimodal LLMs. Our method uses a lightweight tracking model to find object correspondences between frames in a video or between sets of image viewpoints. It selects the most frequent object instances and visualizes them with markers with unique IDs in the image. With this simple approach, we achieve state-of-the-art results on 3D understanding benchmarks including ScanQA (+20.5\%) and a subset of OpenEQA (+9.7\%), and on long-form video benchmarks such as EgoSchema (+6.0\%). We also curate a small diagnostic dataset to evaluate whether MLLMs can reason about space from a described viewpoint other than the camera viewpoint. Again, Coarse Correspondence improves spatial perspective-taking abilities but we highlight that MLLMs struggle with this task. Together, we demonstrate that our simple prompting method can significantly aid downstream tasks that require 3D or temporal reasoning.

arxiv情報

著者	Benlin Liu,Yuhao Dong,Yiqin Wang,Yongming Rao,Yansong Tang,Wei-Chiu Ma,Ranjay Krishna
発行日	2024-08-01 17:57:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Coarse Correspondence Elicit 3D Spacetime Understanding in Multimodal Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー