Towards Foundation Models for 3D Vision: How Close Are We?

要約

3D ビジョンの基礎モデルの構築は、未解決のままの複雑な課題です。
その目標に向けて、現在のモデルの 3D 推論機能を理解し、これらのモデルと人間との間のギャップを特定することが重要です。
したがって、Visual Question Answering (VQA) 形式で基本的な 3D ビジョンタスクをカバーする新しい 3D 視覚理解ベンチマークを構築します。
私たちは、最先端の視覚言語モデル (VLM)、特殊化されたモデル、および人間の被験者を評価します。
私たちの結果は、VLM は一般にパフォーマンスが低いのに対し、特殊なモデルは正確ではありますが堅牢ではなく、幾何学的摂動の下では失敗することを示しています。
対照的に、人間の視覚は引き続き最も信頼性の高い 3D 視覚システムです。
さらに、ニューラルネットワークは、古典的なコンピュータービジョン手法と比較して人間の 3D 視覚メカニズムとより密接に連携し、ViT などの Transformer ベースのネットワークは CNN よりも人間の 3D 視覚メカニズムとより緊密に連携することを示します。
私たちの研究が、3D ビジョンの基礎モデルの将来の開発に役立つことを願っています。

要約(オリジナル)

Building a foundation model for 3D vision is a complex challenge that remains unsolved. Towards that goal, it is important to understand the 3D reasoning capabilities of current models as well as identify the gaps between these models and humans. Therefore, we construct a new 3D visual understanding benchmark that covers fundamental 3D vision tasks in the Visual Question Answering (VQA) format. We evaluate state-of-the-art Vision-Language Models (VLMs), specialized models, and human subjects on it. Our results show that VLMs generally perform poorly, while the specialized models are accurate but not robust, failing under geometric perturbations. In contrast, human vision continues to be the most reliable 3D visual system. We further demonstrate that neural networks align more closely with human 3D vision mechanisms compared to classical computer vision methods, and Transformer-based networks such as ViT align more closely with human 3D vision mechanisms than CNNs. We hope our study will benefit the future development of foundation models for 3D vision.

arxiv情報

著者	Yiming Zuo,Karhan Kayan,Maggie Wang,Kevin Jeon,Jia Deng,Thomas L. Griffiths
発行日	2024-10-14 17:57:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Towards Foundation Models for 3D Vision: How Close Are We?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー