Audio Visual Language Maps for Robot Navigation

要約

世界での相互作用は多感覚の経験ですが、多くのロボットは、主に視覚に依存して環境内をマッピングおよびナビゲートし続けています。
この作業では、オーディオ、ビジュアル、および言語の手がかりからのクロスモーダル情報を格納するための統合された 3D 空間マップ表現である Audio-Visual-Language Maps (AVLMaps) を提案します。
AVLMaps は、インターネットスケールのデータで事前にトレーニングされたマルチモーダル基盤モデルのオープン語彙機能を、それらの機能を集中化された 3D ボクセルグリッドに融合することによって統合します。
ナビゲーションのコンテキストでは、AVLMaps を使用すると、ロボットシステムがマルチモーダルクエリ (ランドマークのテキストによる説明、画像、またはオーディオスニペットなど) に基づいてマップ内の目標にインデックスを付けることができることを示します。
特に、音声情報の追加により、ロボットはより確実に目標の場所を明確にすることができます。
シミュレーションでの広範な実験は、AVLMaps がマルチモーダルプロンプトからのゼロショットマルチモーダルゴールナビゲーションを可能にし、あいまいなシナリオで 50% 優れたリコールを提供することを示しています。
これらの機能は、実世界のモバイルロボットにも適用され、視覚、聴覚、および空間の概念を参照してランドマークに移動します。
ビデオとコードは、https://avlmaps.github.io で入手できます。

要約(オリジナル)

While interacting in the world is a multi-sensory experience, many robots continue to predominantly rely on visual perception to map and navigate in their environments. In this work, we propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues. AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid. In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks. In particular, the addition of audio information enables robots to more reliably disambiguate goal locations. Extensive experiments in simulation show that AVLMaps enable zero-shot multimodal goal navigation from multimodal prompts and provide 50% better recall in ambiguous scenarios. These capabilities extend to mobile robots in the real world – navigating to landmarks referring to visual, audio, and spatial concepts. Videos and code are available at: https://avlmaps.github.io.

arxiv情報

著者	Chenguang Huang,Oier Mees,Andy Zeng,Wolfram Burgard
発行日	2023-03-27 15:10:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Audio Visual Language Maps for Robot Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー