COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

要約

ビジョンと言語のナビゲーション（VLN）タスクは、ホームアシスタントのような分野での潜在的な適用により、人工知能の研究で顕著になりました。
多くの現代のVLNアプローチは、トランスアーキテクチャに基づいていますが、外部の知識ベースやマップ情報などの追加コンポーネントがパフォーマンスを向上させるための追加のコンポーネントをますます組み込んでいます。
これらの追加は、パフォーマンスを向上させながら、より大きなモデルと計算コストの増加にもつながります。
このホワイトペーパーでは、高性能と低い計算コストの両方を達成するために、選択的な暗記（COSMO）の組み合わせで新しいアーキテクチャを提案します。
具体的には、COSMOは状態空間モジュールとトランスモジュールを統合し、2つのVLN顧客顧客選択状態空間モジュールを組み込みます：ラウンド選択スキャン（RSS）とクロスモーダル選択状態空間モジュール（CS3）。
RSSは、1回のスキャン内で包括的なモーダル間の相互作用を促進しますが、CS3モジュールは選択状態空間モジュールをデュアルストリームアーキテクチャに適応させ、それによりクロスモーダル相互作用の獲得を強化します。
3つの主流のVLNベンチマーク、Reverie、R2R、およびR2R-CEの実験的検証は、モデルの競争力のあるナビゲーションパフォーマンスを実証するだけでなく、計算コストの大幅な削減を示しています。

要約(オリジナル)

Vision-and-Language Navigation (VLN) tasks have gained prominence within artificial intelligence research due to their potential application in fields like home assistants. Many contemporary VLN approaches, while based on transformer architectures, have increasingly incorporated additional components such as external knowledge bases or map information to enhance performance. These additions, while boosting performance, also lead to larger models and increased computational costs. In this paper, to achieve both high performance and low computational costs, we propose a novel architecture with the COmbination of Selective MemOrization (COSMO). Specifically, COSMO integrates state-space modules and transformer modules, and incorporates two VLN-customized selective state space modules: the Round Selective Scan (RSS) and the Cross-modal Selective State Space Module (CS3). RSS facilitates comprehensive inter-modal interactions within a single scan, while the CS3 module adapts the selective state space module into a dual-stream architecture, thereby enhancing the acquisition of cross-modal interactions. Experimental validations on three mainstream VLN benchmarks, REVERIE, R2R, and R2R-CE, not only demonstrate competitive navigation performance of our model but also show a significant reduction in computational costs.

arxiv情報

著者	Siqi Zhang,Yanyuan Qiao,Qunbo Wang,Zike Yan,Qi Wu,Zhihua Wei,Jing Liu
発行日	2025-03-31 13:24:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー