RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

要約

Vision-and-Language Navigation (VLN) は、主に既存のシミュレータを手動でキュレーションすることによって制限されているため、トレーニングデータの多様性と規模が限られているという問題があります。
これに対処するために、実際の屋内空間と人間の歩行デモンストレーションをキャプチャした Web ベースのルームツアービデオから派生したビデオ指導データセットである RoomTour3D を導入します。
既存の VLN データセットとは異なり、RoomTour3D はオンラインビデオの規模と多様性を活用して、無制限の人間の歩行軌跡とオープンワールドのナビゲーション可能な指示を生成します。
オンラインビデオのナビゲーションデータの不足を補うために、3D 再構成を実行し、部屋の種類、オブジェクトの位置、周囲のシーンの 3D 形状に関する追加情報を追加した歩行経路の 3D 軌跡を取得します。
私たちのデータセットには、$\sim$200K の命令を含む $\sim$100K のオープンエンドの説明が豊富な軌跡と、1847 のルームツアー環境からの 17K のアクションが豊富な軌跡が含まれています。
RoomTour3D により、CVDN、SOON、R2R、REVERIE などの複数の VLN タスク全体で大幅な改善が可能になることを実験的に実証します。
さらに、RoomTour3D はトレーニング可能なゼロショット VLN エージェントの開発を促進し、オープンワールドナビゲーションに向けて前進する可能性と課題を示します。

要約(オリジナル)

Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.

arxiv情報

著者	Mingfei Han,Liang Ma,Kamila Zhumakhanova,Ekaterina Radionova,Jingyi Zhang,Xiaojun Chang,Xiaodan Liang,Ivan Laptev
発行日	2024-12-11 18:10:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー