InstruGen: Automatic Instruction Generation for Vision-and-Language Navigation Via Large Multimodal Models

要約

視覚と言語のナビゲーション (VLN) に関する最近の研究では、現実的なトレーニング環境と高品質のパスと命令のペアが不足しているため、エージェントは目に見えない環境で汎化が不十分であることが示されています。
現実的なナビゲーションシーンを構築するための既存の方法のほとんどはコストが高く、命令の拡張は主に事前定義されたテンプレートまたはルールに依存しており、適応性に欠けています。
この問題を軽減するために、VLN パスと命令のペア生成パラダイムである InstruGen を提案します。
具体的には、YouTube ハウスツアービデオを現実的なナビゲーションシーンとして使用し、大規模マルチモーダルモデル (LMM) の強力な視覚的理解と生成機能を活用して、多様で高品質な VLN パス命令ペアを自動的に生成します。
私たちの方法は、さまざまな粒度でナビゲーション命令を生成し、以前の方法では達成することが困難であった、命令と視覚的観察の間のきめ細かい位置合わせを実現します。
さらに、LMM の幻覚や不一致を軽減するために、多段階の検証メカニズムを設計します。
実験結果は、InstruGen によって生成されたパスと命令のペアでトレーニングされたエージェントが、特に目に見えない環境において、R2R および RxR ベンチマークで最先端のパフォーマンスを達成することを示しています。
コードは https://github.com/yanyu0526/InstruGen で入手できます。

要約(オリジナル)

Recent research on Vision-and-Language Navigation (VLN) indicates that agents suffer from poor generalization in unseen environments due to the lack of realistic training environments and high-quality path-instruction pairs. Most existing methods for constructing realistic navigation scenes have high costs, and the extension of instructions mainly relies on predefined templates or rules, lacking adaptability. To alleviate the issue, we propose InstruGen, a VLN path-instruction pairs generation paradigm. Specifically, we use YouTube house tour videos as realistic navigation scenes and leverage the powerful visual understanding and generation abilities of large multimodal models (LMMs) to automatically generate diverse and high-quality VLN path-instruction pairs. Our method generates navigation instructions with different granularities and achieves fine-grained alignment between instructions and visual observations, which was difficult to achieve with previous methods. Additionally, we design a multi-stage verification mechanism to reduce hallucinations and inconsistency of LMMs. Experimental results demonstrate that agents trained with path-instruction pairs generated by InstruGen achieves state-of-the-art performance on the R2R and RxR benchmarks, particularly in unseen environments. Code is available at https://github.com/yanyu0526/InstruGen.

arxiv情報

著者	Yu Yan,Rongtao Xu,Jiazhao Zhang,Peiyang Li,Xiaodan Liang,Jianqin Yin
発行日	2024-11-18 09:11:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

InstruGen: Automatic Instruction Generation for Vision-and-Language Navigation Via Large Multimodal Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー