g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

要約

具現化されたタスクのために大規模な 3D 言語データセットで事前トレーニングされた 3D 表現モデルである、Generalizable 3D-Language Feature Fields (g3D-LF) を紹介します。
当社の g3D-LF は、エージェントからのポーズ付き RGB-D 画像を処理して、次の特徴フィールドをエンコードします。 1) 3D シーンの任意の位置からの新しいビュー表現の予測。
2) エージェントを中心とした BEV マップの生成。
３）上記の表現内で多粒度言語を使用してターゲットをクエリする。
私たちの表現は目に見えない環境に一般化でき、リアルタイムの構築と動的な更新が可能になります。
サンプリングされたレイに沿って潜在フィーチャをボリュームレンダリングし、マルチスケールエンコーダーを通じて意味論的および空間的関係を統合することにより、g3D-LF は、マルチレベルの対照学習を通じて、マルチ粒度の言語に合わせて、さまざまなスケールと視点で表現を生成します。
さらに、特徴フィールドの表現を言語に合わせて大規模な 3D 言語データセットを準備します。
パノラマと単眼の両方の設定での視覚と言語のナビゲーション、ゼロショットオブジェクトナビゲーション、および状況に応じた質問応答タスクに関する広範な実験により、具体化されたタスクに対する g3D-LF の重要な利点と有効性が強調されています。

要約(オリジナル)

We introduce Generalizable 3D-Language Feature Fields (g3D-LF), a 3D representation model pre-trained on large-scale 3D-language dataset for embodied tasks. Our g3D-LF processes posed RGB-D images from agents to encode feature fields for: 1) Novel view representation predictions from any position in the 3D scene; 2) Generations of BEV maps centered on the agent; 3) Querying targets using multi-granularity language within the above-mentioned representations. Our representation can be generalized to unseen environments, enabling real-time construction and dynamic updates. By volume rendering latent features along sampled rays and integrating semantic and spatial relationships through multiscale encoders, our g3D-LF produces representations at different scales and perspectives, aligned with multi-granularity language, via multi-level contrastive learning. Furthermore, we prepare a large-scale 3D-language dataset to align the representations of the feature fields with language. Extensive experiments on Vision-and-Language Navigation under both Panorama and Monocular settings, Zero-shot Object Navigation, and Situated Question Answering tasks highlight the significant advantages and effectiveness of our g3D-LF for embodied tasks.

arxiv情報

著者	Zihan Wang,Gim Hee Lee
発行日	2024-11-26 01:54:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

g3D-LF: Generalizable 3D-Language Feature Fields for Embodied Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー