LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification

要約

投機的デコードは、大規模な言語モデル（LLMS）における自己回帰デコードの高い推論潜時を緩和するための有望な手法となっています。
その約束にもかかわらず、LLMSでの投機的デコードの効果的な適用は、依然として3つの重要な課題に直面しています。ドラフトモデルのメモリ要求の増加、短期訓練コーパスと長期コンテキストの推論の間の分布シフト、および注意実装の非効率性です。
この作業では、これらの課題に対処することにより、長いコンテキスト設定での投機的デコードのパフォーマンスを向上させます。
まず、一定サイズのキー価値（kV）キャッシュを備えたメモリ効率の高いドラフトモデルを提案します。
第二に、ショートテキストトレーニングからロングコンテキスト推論までのシームレスな適応を可能にする、短編データの新しい位置インデックスを紹介します。
最後に、プレフィックス計算の高速実装とツリーマスク処理の標準的な注意を組み合わせた革新的な注意集計方法を提示し、ツリーデコードのレイテンシとメモリの非効率性を効果的に解決します。
私たちのアプローチは、リポジトリレベルのコードの完了、ロングコンテキストの要約、O1様の長い推論タスクなど、さまざまな長いコンテストタスクで強力な結果を達成し、潜時の削減の大幅な改善を示しています。
このコードは、https：//github.com/sail-sg/longspecで入手できます。

要約(オリジナル)

Speculative decoding has become a promising technique to mitigate the high inference latency of autoregressive decoding in Large Language Models (LLMs). Despite its promise, the effective application of speculative decoding in LLMs still confronts three key challenges: the increasing memory demands of the draft model, the distribution shift between the short-training corpora and long-context inference, and inefficiencies in attention implementation. In this work, we enhance the performance of speculative decoding in long-context settings by addressing these challenges. First, we propose a memory-efficient draft model with a constant-sized Key-Value (KV) cache. Second, we introduce novel position indices for short-training data, enabling seamless adaptation from short-context training to long-context inference. Finally, we present an innovative attention aggregation method that combines fast implementations for prefix computation with standard attention for tree mask handling, effectively resolving the latency and memory inefficiencies of tree decoding. Our approach achieves strong results on various long-context tasks, including repository-level code completion, long-context summarization, and o1-like long reasoning tasks, demonstrating significant improvements in latency reduction. The code is available at https://github.com/sail-sg/LongSpec.

arxiv情報

著者	Penghui Yang,Cunxiao Du,Fengzhuo Zhang,Haonan Wang,Tianyu Pang,Chao Du,Bo An
発行日	2025-02-24 18:53:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LongSpec: Long-Context Speculative Decoding with Efficient Drafting and Verification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー