NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

要約

Vision-and-Language Navigation (VLN)は、具現化されたエージェントにとって不可欠なスキルであり、自然言語の指示に従って3D環境をナビゲートすることを可能にする。高性能なナビゲーションモデルには大量の学習データが必要であり、手作業でデータにアノテーションを付けるには高いコストがかかるため、この分野では深刻な障害となっている。そのため、これまでのいくつかの手法では、軌跡動画をステップバイステップの指示に変換してデータを拡張しているが、そのような指示は、目的地を簡潔に説明したり、特定のニーズを述べたりするユーザのコミュニケーションスタイルにうまくマッチしない。さらに、局所的なナビゲーション軌跡は、グローバルな文脈や高レベルのタスク計画を見落としている。これらの問題に対処するために、我々はVLNのためのユーザー要求指示を生成する検索支援生成（RAG）フレームワークであるNavRAGを提案する。NavRAGは、LLMを活用し、グローバルなレイアウトから局所的な詳細に至る3Dシーン理解のための階層的なシーン記述ツリーを構築し、次に、シーンツリーから検索する特定の要求を持つ様々なユーザの役割をシミュレートし、LLMを用いて多様な命令を生成する。861シーンに渡る200万以上のナビゲーション命令をアノテーションし、学習済みモデルのデータ品質とナビゲーション性能を評価する。

要約(オリジナル)

Vision-and-Language Navigation (VLN) is an essential skill for embodied agents, allowing them to navigate in 3D environments following natural language instructions. High-performance navigation models require a large amount of training data, the high cost of manually annotating data has seriously hindered this field. Therefore, some previous methods translate trajectory videos into step-by-step instructions for expanding data, but such instructions do not match well with users’ communication styles that briefly describe destinations or state specific needs. Moreover, local navigation trajectories overlook global context and high-level task planning. To address these issues, we propose NavRAG, a retrieval-augmented generation (RAG) framework that generates user demand instructions for VLN. NavRAG leverages LLM to build a hierarchical scene description tree for 3D scene understanding from global layout to local details, then simulates various user roles with specific demands to retrieve from the scene tree, generating diverse instructions with LLM. We annotate over 2 million navigation instructions across 861 scenes and evaluate the data quality and navigation performance of trained models.

arxiv情報

著者	Zihan Wang,Yaohui Zhu,Gim Hee Lee,Yachun Fan
発行日	2025-03-03 12:56:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

NavRAG: Generating User Demand Instructions for Embodied Navigation through Retrieval-Augmented LLM

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー