Scaling Data Generation in Vision-and-Language Navigation

要約

言語ガイドによる視覚ナビゲーションに関する最近の研究では、移動可能な環境の多様性と、一般化可能なエージェントを訓練するための監視の量に対する大きな需要が実証されています。
既存の視覚および言語ナビゲーションデータセットに共通するデータ不足の問題に取り組むために、HM3D および Gibson データセットから 1,200 以上のフォトリアリスティックな環境を適用し、490 万の命令軌道を合成する、学習用の大規模データを生成するための効果的なパラダイムを提案します。
Web 上の完全にアクセス可能なリソースを使用してペアを作成します。
重要なのは、このパラダイムの各コンポーネントがエージェントのパフォーマンスに与える影響を調査し、拡張データを適切に適用してエージェントの事前トレーニングと微調整を行う方法を研究することです。
大規模なデータセットのおかげで、既存のエージェントのパフォーマンスを押し上げることができ (以前の SoTA と比較して絶対値 +11%)、R2R テストの単一実行成功率を単純な模倣で分割した場合の 80% という大幅に新しい最高値にまで引き上げることができます。
学ぶ。
目に見える環境と目に見えない環境でのナビゲーションの間の長期にわたる一般化ギャップも 1% 未満に縮小されます (以前の最良の方法では 8% でした)。
さらに、私たちのパラダイムは、さまざまなモデルが継続的な環境で CVDN、REVERIE、および R2R 上で新しい最先端のナビゲーション結果を達成することも容易にします。

要約(オリジナル)

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent’s performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments.

arxiv情報

著者	Zun Wang,Jialu Li,Yicong Hong,Yi Wang,Qi Wu,Mohit Bansal,Stephen Gould,Hao Tan,Yu Qiao
発行日	2023-08-09 23:51:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Data Generation in Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー