Prompt Cache: Modular Attention Reuse for Low-Latency Inference

要約

さまざまな LLM プロンプト間でアテンション状態を再利用することで、大規模言語モデル (LLM) の推論を高速化するアプローチであるプロンプトキャッシュを紹介します。
多くの入力プロンプトには、システムメッセージ、プロンプトテンプレート、コンテキスト用に提供されたドキュメントなど、重複するテキストセグメントがあります。
私たちの重要な洞察は、これらの頻繁に発生するテキストセグメントのアテンション状態を事前に計算して推論サーバーに保存することで、これらのセグメントがユーザープロンプトに表示されるときに効率的に再利用できるということです。
プロンプトキャッシュは、プロンプトモジュールと呼ばれる、このような再利用可能なテキストセグメントを明示的に定義するスキーマを使用します。
このスキーマは、アテンション状態の再利用時の位置精度を保証し、プロンプトでキャッシュされた状態にアクセスするためのインターフェイスをユーザーに提供します。
プロトタイプ実装を使用して、複数の LLM にわたるプロンプトキャッシュを評価します。
プロンプトキャッシュにより、特にドキュメントベースの質問回答や推奨事項などの長いプロンプトの場合、最初のトークンまでの時間のレイテンシが大幅に短縮されることがわかります。
GPU ベースの推論の 8 倍から CPU ベースの推論の 60 倍まで、出力の精度を維持しながら、モデルパラメーターを変更する必要がなく、その改善範囲は多岐にわたります。

要約(オリジナル)

We present Prompt Cache, an approach for accelerating inference for large language models (LLM) by reusing attention states across different LLM prompts. Many input prompts have overlapping text segments, such as system messages, prompt templates, and documents provided for context. Our key insight is that by precomputing and storing the attention states of these frequently occurring text segments on the inference server, we can efficiently reuse them when these segments appear in user prompts. Prompt Cache employs a schema to explicitly define such reusable text segments, called prompt modules. The schema ensures positional accuracy during attention state reuse and provides users with an interface to access cached states in their prompt. Using a prototype implementation, we evaluate Prompt Cache across several LLMs. We show that Prompt Cache significantly reduce latency in time-to-first-token, especially for longer prompts such as document-based question answering and recommendations. The improvements range from 8x for GPU-based inference to 60x for CPU-based inference, all while maintaining output accuracy and without the need for model parameter modifications.

arxiv情報

著者	In Gim,Guojun Chen,Seung-seob Lee,Nikhil Sarda,Anurag Khandelwal,Lin Zhong
発行日	2024-04-25 15:45:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompt Cache: Modular Attention Reuse for Low-Latency Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー