Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

要約

Vision-and-Language Navigation（VLN）は、具体化されたエージェントが空間的モビリティを活用して、自然言語の指示に基づいて指定された宛先に向けて3D環境でナビゲートするコアタスクです。
最近、強力な一般化能力と豊富な常識的な知識を備えたビデオ言語大規模モデル（ビデオVLMS）は、VLNタスクに適用されると顕著なパフォーマンスを示しています。
ただし、これらのモデルは、実際の3Dナビゲーションに適用されると、以下の課題に遭遇します。1）3Dジオメトリと空間セマンティクスの理解が不十分です。
2）大規模な探査と長期的な環境記憶のための限られた能力。
3）動的および変化する環境への適応性が低い。これらの制限に対処するために、ナビゲーションアクション予測で3D-VLMをトレーニングするための視覚的入力として言語整列、一般化可能、および階層的な3D表現を活用する動的階層化された3D表現モデルであるDynam3Dを提案します。
Posed RGB-D画像を考慮して、Dynam3Dプロジェクト2Dクリップ機能は3Dスペースに機能し、ダイナミックおよびレイヤーごとの更新戦略を使用して、3D幾何学的およびセマンティック理解のためのマルチレベルの3Dパッチインスタンスゾーン表現を構築します。
Dynam3Dは、3Dインスタンスのオンラインエンコードとローカリゼーションが可能であり、変化する環境でそれらを動的に更新して、ナビゲーションの大規模な探索と長期のメモリ機能を提供します。
大規模な3D言語の事前トレーニングとタスク固有の適応を活用することにより、Dynam3Dは、単眼環境下のR2R-CE、Reverie-CE、Navrag-CEを含むVLNベンチマークで新しい最先端のパフォーマンスを設定します。
さらに、実験前、生涯メモリ、および実際のロボットの実験は、実際の展開の有効性を検証します。

要約(オリジナル)

Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large models (Video-VLMs) with strong generalization capabilities and rich commonsense knowledge have shown remarkable performance when applied to VLN tasks. However, these models still encounter the following challenges when applied to real-world 3D navigation: 1) Insufficient understanding of 3D geometry and spatial semantics; 2) Limited capacity for large-scale exploration and long-term environmental memory; 3) Poor adaptability to dynamic and changing environments.To address these limitations, we propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input to train 3D-VLM in navigation action prediction. Given posed RGB-D images, our Dynam3D projects 2D CLIP features into 3D space and constructs multi-level 3D patch-instance-zone representations for 3D geometric and semantic understanding with a dynamic and layer-wise update strategy. Our Dynam3D is capable of online encoding and localization of 3D instances, and dynamically updates them in changing environments to provide large-scale exploration and long-term memory capabilities for navigation. By leveraging large-scale 3D-language pretraining and task-specific adaptation, our Dynam3D sets new state-of-the-art performance on VLN benchmarks including R2R-CE, REVERIE-CE and NavRAG-CE under monocular settings. Furthermore, experiments for pre-exploration, lifelong memory, and real-world robot validate the effectiveness of practical deployment.

arxiv情報

著者	Zihan Wang,Seungjun Lee,Gim Hee Lee
発行日	2025-05-16 15:46:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー