Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation

要約

ナビゲーション命令生成用の新しいスピーカーモデル \textsc{Kefa} を導入します。
視覚と言語ナビゲーションの既存のスピーカーモデルは、異なる環境間での視覚特徴のドメインギャップが大きく、時間的接地能力が不十分であるという問題に悩まされています。
この課題に対処するために、外部の知識事実を使用して特徴表現を強化する知識洗練モジュールと、生成された命令と観測シーケンスの間のきめ細かい位置合わせを強制する適応型時間的位置合わせ方法を提案します。
さらに、方向フレーズの正確さを意識したナビゲーション指示評価のための新しい指標 SPICE-D を提案します。
R2R および UrbanWalk データセットの実験結果は、提案された KEFA スピーカーが屋内と屋外の両方のシーンで最先端の命令生成パフォーマンスを達成することを示しています。

要約(オリジナル)

We introduce a novel speaker model \textsc{Kefa} for navigation instruction generation. The existing speaker models in Vision-and-Language Navigation suffer from the large domain gap of vision features between different environments and insufficient temporal grounding capability. To address the challenges, we propose a Knowledge Refinement Module to enhance the feature representation with external knowledge facts, and an Adaptive Temporal Alignment method to enforce fine-grained alignment between the generated instructions and the observation sequences. Moreover, we propose a new metric SPICE-D for navigation instruction evaluation, which is aware of the correctness of direction phrases. The experimental results on R2R and UrbanWalk datasets show that the proposed KEFA speaker achieves state-of-the-art instruction generation performance for both indoor and outdoor scenes.

arxiv情報

著者	Haitian Zeng,Xiaohan Wang,Wenguan Wang,Yi Yang
発行日	2023-07-25 09:39:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Kefa: A Knowledge Enhanced and Fine-grained Aligned Speaker for Navigation Instruction Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー