Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

要約

強力な表現機能を備えた事前トレーニング済み視覚言語モデルは、視覚言語ナビゲーション (VLN) で広く使用されています。
ただし、それらのほとんどは Web クロールされた汎用データセットでトレーニングされているため、VLN タスクに使用するとかなりのドメインギャップが生じます。
VLN のもう 1 つの課題は、エージェントが軌道上のアクション間の文脈上の関係をどのように理解し、クロスモーダル調整を順番に実行するかです。
この論文では、これらの問題に対処するための新しい Prompt-based context-and Domain-Aware (PANDA) 事前トレーニングフレームワークを提案します。
プロンプトは 2 段階で実行されます。
ドメイン認識段階では、低コストのプロンプト調整パラダイムを適用して、VLN タスクでオブジェクトレベルおよびシーンレベルのクロスモーダルアライメントを事前トレーニング済みモデルに装備するために、ドメイン内データセットからソフトビジュアルプロンプトを学習します。
さらに、コンテキスト認識段階では、シーケンスレベルのセマンティクスを捕捉し、命令内のアウトオブコンテキストとコンテキストの両方の知識をクロスモーダル表現に浸透させるためのハードコンテキストプロンプトのセットを設計します。
これらにより、対照学習による事前トレーニング済みモデルのさらなる調整が可能になります。
R2R と REVERIE の両方に関する実験結果は、以前の最先端の方法と比較して PANDA の優位性を示しています。

要約(オリジナル)

With strong representation capabilities, pretrained vision-language models are widely used in vision and language navigation (VLN). However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. Another challenge for VLN is how the agent understands the contextual relations between actions on a trajectory and performs cross-modal alignment sequentially. In this paper, we propose a novel Prompt-bAsed coNtext- and Domain-Aware (PANDA) pretraining framework to address these problems. It performs prompting in two stages. In the domain-aware stage, we apply a low-cost prompt tuning paradigm to learn soft visual prompts from an in-domain dataset for equipping the pretrained models with object-level and scene-level cross-modal alignment in VLN tasks. Furthermore, in the context-aware stage, we design a set of hard context prompts to capture the sequence-level semantics and instill both out-of-context and contextual knowledge in the instruction into cross-modal representations. They enable further tuning of the pretrained models via contrastive learning. Experimental results on both R2R and REVERIE show the superiority of PANDA compared to previous state-of-the-art methods.

arxiv情報

著者	Ting Liu,Wansen Wu,Yue Hu,Youkai Wang,Kai Xu,Quanjun Yin
発行日	2023-09-07 11:58:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompt-based Context- and Domain-aware Pretraining for Vision and Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー