Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

要約

複雑な3Dシーンの理解は注目を集めており、シーンをエンコードする戦略がこの成功に重要な役割を果たしています。
ただし、さまざまなシナリオの戦略をエンコードする最適なシーンは、特に画像ベースのカウンターパートと比較して、不明のままです。
この問題に対処するために、3Dシーンの理解のためのさまざまな視覚エンコーディングモデルを調査し、さまざまなシナリオ全体の各モデルの強みと制限を特定する包括的な研究を提示します。
私たちの評価は、画像ベース、ビデオベース、3Dファンデーションモデルを含む7つのVision Foundationエンコーダーに及びます。
これらのモデルを4つのタスクで評価します：ビジョン言語シーンの推論、視覚的接地、セグメンテーション、登録、それぞれがシーンの理解のさまざまな側面に焦点を当てています。
私たちの評価は重要な調査結果をもたらします：DINOV2は優れたパフォーマンスを実証し、ビデオモデルはオブジェクトレベルのタスクに優れており、拡散モデルは幾何学的なタスクに利益をもたらし、言語で基づいたモデルは言語関連のタスクに予期しない制限を示します。
これらの洞察は、いくつかの従来の理解に挑戦し、Visual Foundationモデルの活用に関する新しい視点を提供し、将来のビジョン言語とシーンに理解できるタスクにおけるより柔軟なエンコーダー選択の必要性を強調しています。
コード：https：//github.com/yunzeman/lexicon3d

要約(オリジナル)

Complex 3D scene understanding has gained increasing attention, with scene encoding strategies playing a crucial role in this success. However, the optimal scene encoding strategies for various scenarios remain unclear, particularly compared to their image-based counterparts. To address this issue, we present a comprehensive study that probes various visual encoding models for 3D scene understanding, identifying the strengths and limitations of each model across different scenarios. Our evaluation spans seven vision foundation encoders, including image-based, video-based, and 3D foundation models. We evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual Grounding, Segmentation, and Registration, each focusing on different aspects of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates superior performance, video models excel in object-level tasks, diffusion models benefit geometric tasks, and language-pretrained models show unexpected limitations in language-related tasks. These insights challenge some conventional understandings, provide novel perspectives on leveraging visual foundation models, and highlight the need for more flexible encoder selection in future vision-language and scene-understanding tasks. Code: https://github.com/YunzeMan/Lexicon3D

arxiv情報

著者	Yunze Man,Shuhong Zheng,Zhipeng Bao,Martial Hebert,Liang-Yan Gui,Yu-Xiong Wang
発行日	2025-05-08 05:10:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー