Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

要約

Vision-and-Language Navigation (VLN) は、人間の指示に基づいてナビゲーションする身体化されたエージェントを開発することを目的としています。
ただし、現在の VLN フレームワークは静的な環境と最適な専門家の監督に依存していることが多く、現実世界への適用性が制限されています。
これに対処するために、人間を意識した視覚と言語のナビゲーション (HA-VLN) を導入します。これは、動的な人間の活動を組み込み、重要な前提を緩和することで従来の VLN を拡張します。
私たちは、動的な人間の活動を Matterport3D データセットと組み合わせた Human-Aware 3D (HA3D) シミュレーターと、人間の活動の記述で R2R を拡張する Human-Aware Room-to-Room (HA-R2R) データセットを提案します。
HA-VLN の課題に取り組むために、動的で効果的なナビゲーションのためのクロスモーダル融合と多様なトレーニング戦略を利用する、専門家監視クロスモーダル (VLN-CM) エージェントと非専門家監視クロスモーダル (VLN-DT) エージェントを紹介します。
人間の環境。
人間の活動を考慮した指標や HA-VLN 特有の課題の体系的な分析を含む包括的な評価は、HA-VLN エージェントの現実世界での堅牢性と適応性を強化するためのさらなる研究の必要性を強調しています。
最終的に、この研究は、身体化された AI と Sim2Real 転送に関する将来の研究のためのベンチマークと洞察を提供し、人間が住む環境でより現実的で適用可能な VLN システムへの道を開きます。

要約(オリジナル)

Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN’s unique challenges, underscores the need for further research to enhance HA-VLN agents’ real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.

arxiv情報

著者	Minghan Li,Heng Li,Zhi-Qi Cheng,Yifei Dong,Yuxuan Zhou,Jun-Yan He,Qi Dai,Teruko Mitamura,Alexander G. Hauptmann
発行日	2024-06-27 15:01:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー