Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

要約

このレポートでは、ICCV 2023 の AVDN チャレンジの優勝作品の手法について詳しく説明します。このコンテストは、目的地に到達するためにドローンエージェントが対話履歴と航空観測を関連付けることを必要とする、対話履歴からの航空ナビゲーション (ANDH) タスクに取り組んでいます。
ドローンエージェントのクロスモーダル接地能力を向上させるために、Target-Grounded Graph-Aware Transformer (TG-GAT) フレームワークを提案します。
具体的には、TG-GAT はまず、グラフ認識トランスフォーマーを活用して時空間依存関係をキャプチャし、これによりナビゲーション状態の追跡と堅牢なアクション計画にメリットをもたらします。
さらに、参照されたランドマークに対するエージェントの認識を高めるために、補助的な視覚グラウンディングタスクが考案されています。
さらに、大規模な言語モデルに基づくハイブリッド拡張戦略を利用して、データ不足の制限を軽減します。
当社の TG-GAT フレームワークは、SPL メトリクスと SR メトリクスでベースラインをそれぞれ 2.2% および 3.0% 絶対的に改善し、AVDN Challenge 2023 で優勝しました。
コードは https://github.com/yifeisu/avdn-challenge で入手できます。

要約(オリジナル)

This report details the method of the winning entry of the AVDN Challenge in ICCV 2023. The competition addresses the Aerial Navigation from Dialog History (ANDH) task, which requires a drone agent to associate dialog history with aerial observations to reach the destination. For better cross-modal grounding abilities of the drone agent, we propose a Target-Grounded Graph-Aware Transformer (TG-GAT) framework. Concretely, TG-GAT first leverages a graph-aware transformer to capture spatiotemporal dependency, which benefits navigation state tracking and robust action planning. In addition, an auxiliary visual grounding task is devised to boost the agent’s awareness of referred landmarks. Moreover, a hybrid augmentation strategy based on large language models is utilized to mitigate data scarcity limitations. Our TG-GAT framework won the AVDN Challenge 2023, with 2.2% and 3.0% absolute improvements over the baseline on SPL and SR metrics, respectively. The code is available at https://github.com/yifeisu/avdn-challenge.

arxiv情報

著者	Yifei Su,Dong An,Yuan Xu,Kehan Chen,Yan Huang
発行日	2023-08-23 05:53:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Target-Grounded Graph-Aware Transformer for Aerial Vision-and-Dialog Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー