PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

要約

最近、ラージビジョン言語モデル (LVLM) は、多様なマルチモーダル入力を考慮した強力な生成機能と推論機能により急速に人気が高まっています。
ただし、これらのモデルでは、推論中に大幅な計算オーバーヘッドとメモリオーバーヘッドが発生するため、実際のシナリオでの効率的な展開が大幅に妨げられます。
長い入力シーケンスと出力シーケンスによって必要となる大規模なキー/値 (KV) キャッシュは、特に推論コストの上昇に寄与します。
これに基づいて、最近の研究では、効率を高めるために KV キャッシュサイズを削減する方法が検討されています。
効果的ではありますが、一般に、レイヤ全体にわたる KV ベクトルの明確な重要度の分布は無視され、次のトークンの予測中に各レイヤで同じキャッシュサイズが維持されます。
これにより、特定のレイヤーのコンテキスト情報が大幅に失われ、顕著なパフォーマンスの低下につながります。
これに対処するために、PrefixKV を紹介します。
これは、すべてのレイヤーの KV キャッシュサイズを決定するという課題を、最適なグローバルプレフィックス構成を検索するタスクに再構成します。
二分探索に基づく適応的なレイヤーごとの KV 保持レシピを使用すると、最大限のコンテキスト情報を各レイヤーに保存でき、生成が容易になります。
広範な実験により、私たちの方法が他の方法と比較して最先端のパフォーマンスを達成できることが実証されています。
優れた推論効率と生成品質のトレードオフを示し、実用化への有望な可能性を示しています。
コードは \url{https://github.com/THU-MIG/PrefixKV} で入手できます。

要約(オリジナル)

Recently, large vision-language models (LVLMs) have rapidly gained popularity for their strong generation and reasoning capabilities given diverse multimodal inputs. However, these models incur significant computational and memory overhead during inference, which greatly hinders the efficient deployment in practical scenarios. The extensive key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost. Based on this, recent works have investigated ways to reduce the KV cache size for higher efficiency. Although effective, they generally overlook the distinct importance distributions of KV vectors across layers and maintain the same cache size for each layer during the next token prediction. This results in the significant contextual information loss for certain layers, leading to notable performance decline. To address this, we present PrefixKV. It reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration. With an adaptive layer-wise KV retention recipe based on binary search, the maximum contextual information can thus be preserved in each layer, facilitating the generation. Extensive experiments demonstrate that our method achieves the state-of-the-art performance compared with others. It exhibits superior inference efficiency and generation quality trade-offs, showing promising potential for practical applications. Code is available at \url{https://github.com/THU-MIG/PrefixKV}.

arxiv情報

著者	Ao Wang,Hui Chen,Jianchao Tan,Kefeng Zhang,Xunliang Cai,Zijia Lin,Jungong Han,Guiguang Ding
発行日	2024-12-04 15:48:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー