3D Concept Learning and Reasoning from Multi-View Images

要約

人間は、周囲の世界の多視点観察を収集することにより、3D で正確に推論することができます。
この洞察に着想を得て、3D マルチビュービジュアル質問応答 (3DMV-VQA) の新しい大規模ベンチマークを導入します。
このデータセットは、ハビタットシミュレータを使用する環境で RGB 画像をアクティブに移動およびキャプチャする具体化されたエージェントによって収集されます。
合計で、約 5,000 のシーン、600,000 の画像、および 50,000 の質問で構成されています。
ベンチマークで視覚的な推論のためにさまざまな最先端のモデルを評価し、それらすべてのパフォーマンスが低いことがわかりました。
多視点画像からの 3D 推論の原則的なアプローチは、多視点画像から世界のコンパクトな 3D 表現を推測することであり、これはさらにオープン語彙の意味概念に基づいており、これらの推論を実行することを提案します。
3D 表現。
このアプローチに向けた最初のステップとして、ニューラルフィールド、2D 事前トレーニング済み視覚言語モデル、およびニューラル推論演算子を介してこれらのコンポーネントをシームレスに組み合わせる、新しい 3D コンセプト学習および推論 (3D-CLR) フレームワークを提案します。
実験結果は、私たちのフレームワークがベースラインモデルよりも大幅に優れていることを示唆していますが、課題はほとんど解決されていません。
さらに、課題の詳細な分析を行い、潜在的な将来の方向性を強調します。

要約(オリジナル)

Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this approach, we propose a novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly combines these components via neural fields, 2D pre-trained vision-language models, and neural reasoning operators. Experimental results suggest that our framework outperforms baseline models by a large margin, but the challenge remains largely unsolved. We further perform an in-depth analysis of the challenges and highlight potential future directions.

arxiv情報

著者	Yining Hong,Chunru Lin,Yilun Du,Zhenfang Chen,Joshua B. Tenenbaum,Chuang Gan
発行日	2023-03-20 17:59:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

3D Concept Learning and Reasoning from Multi-View Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー