LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

要約

クロスアテンションは、視覚情報を言語バックボーンに統合するために、マルチモーダル大規模言語モデル（MLLM）で一般的に採用されています。しかし、ビデオ理解のような大規模な視覚入力を伴うアプリケーションでは、クロスアテンションレイヤーで多数の視覚トークンを処理することは、高いメモリ需要につながり、多くの場合、複数のGPUにまたがる分散計算が必要になります。既存の分散アテンション機構は、大きな通信オーバーヘッドに直面しており、クロスアテンションレイヤーをMLLMの効率的な学習と推論のための重要なボトルネックにしている。この問題に対処するため、我々は、通信オーバーヘッドを最小限に抑えた、分散型の厳密なクロスアテンションメカニズムであるLV-XAttnを提案する。大規模な視覚的入力を含むアプリケーションでは、クエリブロックのサイズはキーバリューブロックのサイズよりもはるかに小さいことが一般的である。したがって、LV-XAttnでは、大きなキー値ブロックを各GPUにローカルに保持し、より小さなクエリブロックをGPU間で交換します。また、効率的な活性化再計算技術を導入することで、より長いビジュアルコンテキストをサポートします。LV-XAttnの通信の利点を理論的に分析し、幅広いモデルで高速化を達成できることを示す。mPLUG-Owl3とOpenFlamingoモデルを用いた評価では、LV-XAttnは既存のアプローチと比較して、エンドツーエンドで最大5.58$times$の高速化を達成することがわかった。

要約(オリジナル)

Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique enabling support for longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 5.58$\times$ end-to-end speedup compared to existing approaches.

arxiv情報

著者	Tzu-Tao Chang,Shivaram Venkataraman
発行日	2025-02-04 15:24:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー