VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

要約

屋外の視覚と言語のナビゲーション (VLN) では、エージェントが自然言語の指示に基づいて現実的な 3D 屋外環境をナビゲートする必要があります。
既存の VLN 手法のパフォーマンスは、ナビゲーション環境の多様性が不十分であり、トレーニングデータが限られているため制限されています。
これらの問題に対処するために、私たちは、米国の複数の都市の運転ビデオに存在する多様な屋外環境を利用し、屋外 VLN パフォーマンスを向上させるために自動生成されたナビゲーション指示とアクションを強化した VLN ビデオを提案します。
VLN-Video は、直感的な古典的なアプローチと最新の深層学習技術の最高の部分を組み合わせたもので、テンプレートの埋め込みを使用して接地されたナビゲーション命令を生成し、画像回転の類似性に基づくナビゲーションアクション予測子と組み合わせて、深層学習 VLN モデルの事前トレーニング用に運転ビデオから VLN スタイルデータを取得します。
。
タッチダウンデータセットと、マスクされた言語モデリング、命令と軌道のマッチング、次のアクションの予測という 3 つのプロキシタスクを使用して、運転ビデオから作成されたビデオ拡張データセットでモデルを事前トレーニングし、時間的に認識し、視覚的に整合した学習を行います。
指示表現。
学習された命令表現は、タッチダウンデータセットを微調整するときに最先端のナビゲーターに適応されます。
実証結果は、VLN-Video がタスク完了率で以前の最先端モデルを 2.1% 大幅に上回り、タッチダウンデータセットで新たな最先端を達成していることを示しています。

要約(オリジナル)

Outdoor Vision-and-Language Navigation (VLN) requires an agent to navigate through realistic 3D outdoor environments based on natural language instructions. The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data. To address these issues, we propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions and actions to improve outdoor VLN performance. VLN-Video combines the best of intuitive classical approaches and modern deep learning techniques, using template infilling to generate grounded navigation instructions, combined with an image rotation similarity-based navigation action predictor to obtain VLN style data from driving videos for pretraining deep learning VLN models. We pre-train the model on the Touchdown dataset and our video-augmented dataset created from driving videos with three proxy tasks: Masked Language Modeling, Instruction and Trajectory Matching, and Next Action Prediction, so as to learn temporally-aware and visually-aligned instruction representations. The learned instruction representation is adapted to the state-of-the-art navigator when fine-tuning on the Touchdown dataset. Empirical results demonstrate that VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate, achieving a new state-of-the-art on the Touchdown dataset.

arxiv情報

著者	Jialu Li,Aishwarya Padmakumar,Gaurav Sukhatme,Mohit Bansal
発行日	2024-02-07 18:02:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー