A-VL: Adaptive Attention for Large Vision-Language Models

要約

大規模なビジョン言語モデル（LVLM）は、コンピュータービジョンと自然言語処理技術を統合し、実質的な応用の可能性を提供します。
ただし、これらのモデルは、推論中に広範なリソースを必要とします。
適応的な注意技術は、計算冗長性を動的に減らし、効率を向上させることができます。
現在の適応型注意方法は、トランスベースの言語モデルのメモリ要件を大幅に削減しますが、LVLMSに合わせて調整されていません。
LVLMSは、リモート画像トークンとローカルテキストトークンの両方から応答を生成し、異なるモダリティが異なる注意パターンを持っていることを観察します。
この観察は、各モダリティの注意を個別に管理するように促します。
具体的には、視覚入力のために、潜在的に有用な情報のキャッシュを保存しますが、最も重要な部分のみを計算します。
言語入力については、ローカル情報についてもっと関心を持っています。
視覚言語の注意パターンの観察と分析に基づいて、LVLM推論に合わせて調整されたプラグアンドプレイの適応的注意であるA-VLを開発します。
3つのビジョン言語タスクと5つのデータセットに関する広範な評価は、デザインの有効性を示しています。
私たちのアプローチA-VLは、パフォーマンスを損なうことなく、メモリの使用量と計算負荷を削減する際の既存の適応的注意方法を上回ります。

要約(オリジナル)

The Large Vision-Language Model (LVLM) integrates computer vision and natural language processing techniques, offering substantial application potential. However, these models demand extensive resources during inference. Adaptive attention techniques can dynamically reduce computational redundancy and thus improve efficiency. Although current adaptive attention methods significantly reduce the memory requirements of Transformer-based language models, they are not tailored for LVLMs. We observe that LVLMs generate responses from both remote image tokens and local text tokens, and different modalities have different attention patterns. This observation inspires us to manage the attention for each modality separately. Specifically, for visual input, we store the cache of potentially useful information but only compute the most critical parts. For language input, we care more about local information. Based on our observation and analysis of vision-language attention patterns, we develop A-VL, a plug-and-play adaptive attention tailored for LVLM inference. Extensive evaluations on three vision-language tasks and five datasets show the effectiveness of our designs. Our approach A-VL outperforms existing adaptive attention methods in reducing memory usage and computational load without compromising performance.

arxiv情報

著者	Junyang Zhang,Mu Yuan,Ruiguang Zhong,Puhan Luo,Huiyou Zhan,Ningkang Zhang,Chengchen Hu,Xiangyang Li
発行日	2025-02-07 13:09:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A-VL: Adaptive Attention for Large Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー