Marconi: Prefix Caching for the Era of Hybrid LLMs

要約

アテンション層の言語モデリング機能とリカレント層 (状態空間モデルなど) の効率を組み合わせたハイブリッドモデルは、大規模言語モデルのサービスで長いコンテキストを実際にサポートする際に注目を集めています。
ただし、これらのモデルの固有の特性により、リクエスト全体で冗長な計算をスキップするプレフィックスキャッシュなどの補完的な効率最適化の使用が複雑になります。
最も注目すべき点は、リカレントレイヤーのインプレース状態更新の使用により、部分的なシーケンスの重複に対するキャッシュエントリのロールバックができなくなり、代わりに完全一致のキャッシュヒットのみが義務付けられることです。
その結果、シーケンスごとに (大規模な) キャッシュエントリが大量に発生し、そのほとんどが再利用の機会を最小限に抑えます。
ハイブリッド LLM による効率的なプレフィックスキャッシュをサポートする最初のシステムである Marconi を紹介します。
Marconi の鍵となるのは、最新性だけでなく、(1) さまざまなヒットシナリオの分類全体での再利用の可能性の予測、および (2) コンピューティングの節約に基づいて、潜在的なキャッシュエントリをより慎重に評価する新しいアドミッションポリシーとエビクションポリシーです。
ヒット数はメモリフットプリントと比較して配信されます。
多様なワークロードとハイブリッドモデルにわたって、Marconi は、最先端のプレフィックスキャッシュシステムと比較して、最大 34.4 倍の高いトークンヒットレート (71.1% または 617 ミリ秒低い TTFT) を達成します。

要約(オリジナル)

Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across requests. Most notably, their use of in-place state updates for recurrent layers precludes rolling back cache entries for partial sequence overlaps, and instead mandates only exact-match cache hits; the effect is a deluge of (large) cache entries per sequence, most of which yield minimal reuse opportunities. We present Marconi, the first system that supports efficient prefix caching with Hybrid LLMs. Key to Marconi are its novel admission and eviction policies that more judiciously assess potential cache entries based not only on recency, but also on (1) forecasts of their reuse likelihood across a taxonomy of different hit scenarios, and (2) the compute savings that hits deliver relative to memory footprints. Across diverse workloads and Hybrid models, Marconi achieves up to 34.4$\times$ higher token hit rates (71.1% or 617 ms lower TTFT) compared to state-of-the-art prefix caching systems.

arxiv情報

著者	Rui Pan,Zhuang Wang,Zhen Jia,Can Karakus,Luca Zancato,Tri Dao,Yida Wang,Ravi Netravali
発行日	2024-12-04 18:40:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Marconi: Prefix Caching for the Era of Hybrid LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー