CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming

要約

大規模言語モデル (LLM) は複雑なタスクを引き受けるため、その入力は、ドメインの知識やユーザー固有の情報を組み込んだより長いコンテキストで補完されます。
しかし、コンテキスト全体が LLM によって処理されるまで何も生成できないため、長いコンテキストを使用すると、応答性の高い LLM システムにとって課題が生じます。
コンテキスト処理の遅延は、さまざまな入力間でコンテキストの KV キャッシュを再利用することで短縮できますが、大きなテンソルを含む KV キャッシュをネットワーク経由でフェッチすると、余分なネットワーク遅延が発生する可能性があります。
CacheGen は、LLM システム用の高速コンテキスト読み込みモジュールです。
まず、CacheGen は、KV キャッシュの分散プロパティを組み込んだカスタムテンソルエンコーダを使用して、エンコード/デコードのオーバーヘッドを無視して KV キャッシュをよりコンパクトなビットストリーム表現にエンコードします。
これにより、KV キャッシュをフェッチするための帯域幅の需要が軽減されます。
次に、低いコンテキスト読み込み遅延と高い生成品質を維持するために、CacheGen はストリーミング戦略を適応させて、利用可能な帯域幅の変化に対処します。
利用可能な帯域幅が低下すると、CacheGen はコンテキストの一部の圧縮レベルを上げるか、オンザフライで KV キャッシュを再計算することを選択することがあります。
さまざまなサイズの 4 つの一般的な LLM と 4 つのデータセット (合計 662 コンテキスト) で CacheGen をテストします。
KV キャッシュを再利用する最近のシステムと比較して、CacheGen は KV キャッシュサイズを 3.7 ～ 4.3 分の 1 に削減し、コンテキストのフェッチと処理における総遅延を 2.7 ～ 3.2 分の 1 に削減します。その一方で、精度や複雑さにおける LLM 応答品質への影響は無視できます。

要約(オリジナル)

As large language models (LLMs) take on complex tasks, their inputs are supplemented with longer contexts that incorporate domain knowledge or user-specific information. Yet using long contexts poses a challenge for responsive LLM systems, as nothing can be generated until the whole context is processed by the LLM. While the context-processing delay can be reduced by reusing the KV cache of a context across different inputs, fetching the KV cache, which contains large tensors, over the network can cause extra network delays. CacheGen is a fast context-loading module for LLM systems. First, CacheGen uses a custom tensor encoder, which embraces KV cache’s distributional properties, to encode a KV cache into more compact bitstream representations with negligible encoding/decoding overhead. This reduces the bandwidth demand to fetch the KV cache. Second, to maintain low context-loading delay and high generation quality, CacheGen adapts the streaming strategies to cope with changes in available bandwidth. When available bandwidth drops, CacheGen may raise the compression level for a part of the context or choose to recompute its KV cache on the fly. We test CacheGen on four popular LLMs of various sizes and four datasets (662 contexts in total). Compared to the recent systems that reuse the KV cache, CacheGen reduces the KV cache size by 3.7-4.3x and the total delay in fetching and processing contexts by 2.7-3.2x while having negligible impact on the LLM response quality in accuracy or perplexity.

arxiv情報

著者	Yuhan Liu,Hanchen Li,Yihua Cheng,Siddhant Ray,Yuyang Huang,Qizheng Zhang,Kuntai Du,Jiayi Yao,Shan Lu,Ganesh Ananthanarayanan,Michael Maire,Henry Hoffmann,Ari Holtzman,Junchen Jiang
発行日	2024-03-14 17:58:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CacheGen: Fast Context Loading for Language Model Applications via KV Cache Streaming

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー