Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

要約

複雑な 3D シーンの理解はますます注目を集めており、シーンのエンコード戦略がこの成功に重要な役割を果たしています。
ただし、さまざまなシナリオに最適なシーンのエンコード戦略は、特に画像ベースのそれに比べて依然として不明瞭です。
この問題に対処するために、3D シーンを理解するためのさまざまなビジュアルエンコーディングモデルを調査し、さまざまなシナリオにわたる各モデルの長所と限界を特定する包括的な研究を紹介します。
私たちの評価は、画像ベース、ビデオベース、3D 基盤モデルを含む 7 つのビジョン基盤エンコーダーに及びます。
これらのモデルを、視覚言語シーン推論、視覚的グラウンディング、セグメンテーション、レジストレーションの 4 つのタスクで評価し、それぞれシーン理解のさまざまな側面に焦点を当てます。
私たちの評価から重要な発見が得られました。DINOv2 は優れたパフォーマンスを示し、ビデオモデルはオブジェクトレベルのタスクに優れ、拡散モデルは幾何学的なタスクに利点をもたらし、言語事前学習モデルは言語関連のタスクで予期せぬ制限を示しました。
これらの洞察は、従来の理解の一部に疑問を投げかけ、ビジュアル基盤モデルの活用に関する新しい視点を提供し、将来のビジョン言語およびシーン理解タスクにおけるより柔軟なエンコーダ選択の必要性を強調します。

要約(オリジナル)

Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks.

arxiv情報

著者	Yunze Man,Shuhong Zheng,Zhipeng Bao,Martial Hebert,Liang-Yan Gui,Yu-Xiong Wang
発行日	2024-09-05 17:59:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー