CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

要約

自然言語の指示を解釈し、複雑な都市環境をナビゲートするためにドローンを要求する空中ビジョンと言語のナビゲーション（VLN）は、人間とロボットの相互作用、3D空間推論、および実際の世界の展開を橋渡しする重要な具体化されたAI課題として浮上します。
既存の地上VLNエージェントは、屋内および屋外の設定で顕著な結果を達成しましたが、事前定義されたナビゲーショングラフが存在しないことと、長期探索で指数関数的に拡大するアクションスペースがあるため、空中VLNで苦労しています。
この作業では、都市の空中VLNのナビゲーションの複雑さを大幅に削減する大規模な言語モデル（LLM）容易なエージェントである\ TextBF {CityNavagent}を提案します。
具体的には、長距離タスクを異なるセマンティックレベルのサブゴールに分解する階層セマンティックプランニングモジュール（HSPM）を設計します。
エージェントは、LLMの異なる容量でサブゴールを達成することにより、徐々にターゲットに到達します。
さらに、歴史的な軌跡をトポロジグラフに保存するグローバルメモリモジュールが開発され、訪問されたターゲットのナビゲーションを簡素化します。
広範なベンチマーク実験は、私たちの方法が大幅な改善を伴う最先端のパフォーマンスを達成することを示しています。
さらなる実験は、連続都市環境における航空VLNに対するCityNavagentのさまざまなモジュールの有効性を示しています。
このコードは、\ href {https://github.com/vinceouti/citynavagent} {link}で利用できます。

要約(オリジナル)

Aerial vision-and-language navigation (VLN), requiring drones to interpret natural language instructions and navigate complex urban environments, emerges as a critical embodied AI challenge that bridges human-robot interaction, 3D spatial reasoning, and real-world deployment. Although existing ground VLN agents achieved notable results in indoor and outdoor settings, they struggle in aerial VLN due to the absence of predefined navigation graphs and the exponentially expanding action space in long-horizon exploration. In this work, we propose \textbf{CityNavAgent}, a large language model (LLM)-empowered agent that significantly reduces the navigation complexity for urban aerial VLN. Specifically, we design a hierarchical semantic planning module (HSPM) that decomposes the long-horizon task into sub-goals with different semantic levels. The agent reaches the target progressively by achieving sub-goals with different capacities of the LLM. Additionally, a global memory module storing historical trajectories into a topological graph is developed to simplify navigation for visited targets. Extensive benchmark experiments show that our method achieves state-of-the-art performance with significant improvement. Further experiments demonstrate the effectiveness of different modules of CityNavAgent for aerial VLN in continuous city environments. The code is available at \href{https://github.com/VinceOuti/CityNavAgent}{link}.

arxiv情報

著者	Weichen Zhang,Chen Gao,Shiquan Yu,Ruiying Peng,Baining Zhao,Qian Zhang,Jinqiang Cui,Xinlei Chen,Yong Li
発行日	2025-05-08 20:01:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー