ULN: Towards Underspecified Vision-and-Language Navigation

要約

Vision-and-Language Navigation (VLN) は、言語命令を使用して、具現化されたエージェントを目標位置に移動させるタスクです。
パフォーマンスが大幅に向上したにもかかわらず、きめの細かい命令が広く使用されているため、実際にはより実用的な言語のバリエーションを特徴付けることができません。
このギャップを埋めるために、私たちは新しい設定、つまりUnderspecified Vision-and-Language Navigation (ULN)、および関連する評価データセットを導入します。
ULN は、より現実的で一般的な設定である純粋な細粒度または粗粒度の代わりに、マルチレベルの指定されていない指示を使用してエージェントを評価します。
ULN への第一歩として、分類モジュール、ナビゲーションエージェント、および Exploitation-to-Exploration (E2E) モジュールで構成される VLN フレームワークを提案します。
具体的には、エージェントが最小の追加パラメーターでマルチレベルの命令を接地するための粒度固有サブネットワーク (GSS) を学習することを提案します。
次に、E2E モジュールがグラウンディングの不確実性を推定し、多段階の先読み調査を実施して、成功率をさらに向上させます。
実験結果によると、既存の VLN モデルは、仕様不足のマルチレベル言語に対して依然として脆弱であることが示されています。
私たちのフレームワークはより堅牢で、ULN のベースラインをすべてのレベルで約 10% の相対成功率で上回っています。

要約(オリジナル)

Vision-and-Language Navigation (VLN) is a task to guide an embodied agent moving to a target position using language instructions. Despite the significant performance improvement, the wide use of fine-grained instructions fails to characterize more practical linguistic variations in reality. To fill in this gap, we introduce a new setting, namely Underspecified vision-and-Language Navigation (ULN), and associated evaluation datasets. ULN evaluates agents using multi-level underspecified instructions instead of purely fine-grained or coarse-grained, which is a more realistic and general setting. As a primary step toward ULN, we propose a VLN framework that consists of a classification module, a navigation agent, and an Exploitation-to-Exploration (E2E) module. Specifically, we propose to learn Granularity Specific Sub-networks (GSS) for the agent to ground multi-level instructions with minimal additional parameters. Then, our E2E module estimates grounding uncertainty and conducts multi-step lookahead exploration to improve the success rate further. Experimental results show that existing VLN models are still brittle to multi-level language underspecification. Our framework is more robust and outperforms the baselines on ULN by ~10% relative success rate across all levels.

arxiv情報

著者	Weixi Feng,Tsu-Jui Fu,Yujie Lu,William Yang Wang
発行日	2022-10-18 17:45:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ULN: Towards Underspecified Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー