RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

要約

ビジョンと言語のナビゲーション（VLN）は、既存のシミュレーターの手動キュレーションによって主に制約されている、限られた多様性とトレーニングデータの規模に苦しんでいます。
これに対処するために、実際の屋内スペースと人間のウォーキングデモンストレーションをキャプチャするWebベースのルームツアービデオから派生したビデオインストラクションデータセットであるroomtour3dを紹介します。
既存のVLNデータセットとは異なり、RoomTour3Dはオンラインビデオのスケールと多様性を活用して、オープンエンドの人間の歩行軌跡とオープンワールドの航行可能な指示を生成します。
オンラインビデオのナビゲーションデータの不足を補うために、3D再構成を実行し、部屋の種類、オブジェクトの場所、周囲のシーンの3D形状に関する追加情報で増強されたウォーキングパスの3D軌跡を取得します。
当社のデータセットには、$ \ SIM $ 100Kの説明範囲が記載された軌跡が含まれており、$ \ sim $ 200kの指示と、1847年のルームツアー環境からのアクションが豊富な軌跡が17kに含まれています。
RoomTour3Dは、CVDN、まもなくR2R、Reverieなどの複数のVLNタスクにわたって大幅な改善を可能にすることを実験的に実証します。
さらに、RoomTour3Dは、トレーニング可能なゼロショットVLNエージェントの開発を促進し、オープンワールドナビゲーションに向けて前進する可能性と課題を紹介します。

要約(オリジナル)

Vision-and-Language Navigation (VLN) suffers from the limited diversity and scale of training data, primarily constrained by the manual curation of existing simulators. To address this, we introduce RoomTour3D, a video-instruction dataset derived from web-based room tour videos that capture real-world indoor spaces and human walking demonstrations. Unlike existing VLN datasets, RoomTour3D leverages the scale and diversity of online videos to generate open-ended human walking trajectories and open-world navigable instructions. To compensate for the lack of navigation data in online videos, we perform 3D reconstruction and obtain 3D trajectories of walking paths augmented with additional information on the room types, object locations and 3D shape of surrounding scenes. Our dataset includes $\sim$100K open-ended description-enriched trajectories with $\sim$200K instructions, and 17K action-enriched trajectories from 1847 room tour environments. We demonstrate experimentally that RoomTour3D enables significant improvements across multiple VLN tasks including CVDN, SOON, R2R, and REVERIE. Moreover, RoomTour3D facilitates the development of trainable zero-shot VLN agents, showcasing the potential and challenges of advancing towards open-world navigation.

arxiv情報

著者	Mingfei Han,Liang Ma,Kamila Zhumakhanova,Ekaterina Radionova,Jingyi Zhang,Xiaojun Chang,Xiaodan Liang,Ivan Laptev
発行日	2025-03-19 10:05:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RoomTour3D: Geometry-Aware Video-Instruction Tuning for Embodied Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー