Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

要約

長いコンテキストの入力を伴うアプリケーションは、大規模言語モデル (LLM) を効果的に利用するために重要ですが、計算コストの増加とパフォーマンスの低下にもつながります。
この課題に対処するために、圧縮されたプロンプト内に重要な情報を保持する、トレーニング不要の効率的なプロンプト圧縮方法を提案します。
私たちは、推論にとって最も重要な長い入力内のトークンを選択できる、トランスフォーマーベースの LLM 内の特定のアテンションヘッドを特定します。これを評価ヘッドとして指定します。
この発見に基づいて、私たちは評価ヘッドベースのプロンプト圧縮手法である EHPC を開発しました。これにより、LLM は、事前入力段階で評価ヘッドを備えた最初の数層のみを利用し、その後、
推論用のモデルにとって重要なトークン。
EHPC は、プロンプト圧縮とロングコンテキスト推論アクセラレーションという 2 つの主流ベンチマークにわたって最先端の結果を達成します。
その結果、商用 API 呼び出しに関連する複雑さとコストが効果的に軽減されます。
さらに、EHPC がキー値キャッシュベースの高速化手法と比較して競合する結果を達成することを実証し、それによってロングコンテキストタスクに対する LLM の効率を向上させる可能性を強調します。

要約(オリジナル)

Although applications involving long-context inputs are crucial for the effective utilization of large language models (LLMs), they also result in increased computational costs and reduced performance. To address this challenge, we propose an efficient, training-free prompt compression method that retains key information within compressed prompts. We identify specific attention heads in transformer-based LLMs, which we designate as evaluator heads, that are capable of selecting tokens in long inputs that are most significant for inference. Building on this discovery, we develop EHPC, an Evaluator Head-based Prompt Compression method, which enables LLMs to rapidly ‘skim through’ input prompts by leveraging only the first few layers with evaluator heads during the pre-filling stage, subsequently passing only the important tokens to the model for inference. EHPC achieves state-of-the-art results across two mainstream benchmarks: prompt compression and long-context inference acceleration. Consequently, it effectively reduces the complexity and costs associated with commercial API calls. We further demonstrate that EHPC attains competitive results compared to key-value cache-based acceleration methods, thereby highlighting its potential to enhance the efficiency of LLMs for long-context tasks.

arxiv情報

著者	Weizhi Fei,Xueyan Niu,Guoqing Xie,Yingqing Liu,Bo Bai,Wei Han
発行日	2025-01-22 15:33:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Prompt Compression with Evaluator Heads for Long-Context Transformer Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー