SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

要約

大規模な言語モデル（LLM）は、さまざまな自然言語処理タスクにわたって並外れたパフォーマンスを示しています。
しかし、それらのかなりのサイズは、特に計算上の要求と推論の速度において、彼らの二次の複雑さのためにかなりの課題をもたらします。
この作業では、重要なパターンを特定しました。特定の一見意味のないセパレータートークン（すなわち、句読点）は、意味的に意味のあるトークンと比較して注意スコアに不釣り合いに貢献します。
この観察結果は、これらのセパレータートークン間のセグメントの情報を、重大な情報損失なしにセパレータートークン自体に効果的に凝縮できることを示唆しています。
この洞察に導かれて、これらのセグメントを圧縮して冗長トークンを排除することにより推論を加速するプラグアンドプレイフレームワークであるSepllmを紹介します。
さらに、トレーニングの加速に効率的なカーネルを実装します。
トレーニングなし、クレイチからのトレーニング、およびトレーニング後の設定にわたる実験結果は、SEPLLMの有効性を示しています。
特に、LLAMA-3-8Bバックボーンを使用して、SEPLLMは、同等のパフォーマンスを維持しながら、GSM8K-COTベンチマークでKVキャッシュを50％以上削減します。
さらに、ストリーミング設定では、SEPLLMは、一貫した言語モデリング機能を維持しながら、最大400万トークン以上のシーケンスを効果的に処理します。

要約(オリジナル)

Large Language Models (LLMs) have exhibited exceptional performance across a spectrum of natural language processing tasks. However, their substantial sizes pose considerable challenges, particularly in computational demands and inference speed, due to their quadratic complexity. In this work, we have identified a key pattern: certain seemingly meaningless separator tokens (i.e., punctuations) contribute disproportionately to attention scores compared to semantically meaningful tokens. This observation suggests that information of the segments between these separator tokens can be effectively condensed into the separator tokens themselves without significant information loss. Guided by this insight, we introduce SepLLM, a plug-and-play framework that accelerates inference by compressing these segments and eliminating redundant tokens. Additionally, we implement efficient kernels for training acceleration. Experimental results across training-free, training-from-scratch, and post-training settings demonstrate SepLLM’s effectiveness. Notably, using the Llama-3-8B backbone, SepLLM achieves over 50% reduction in KV cache on the GSM8K-CoT benchmark while maintaining comparable performance. Furthermore, in streaming settings, SepLLM effectively processes sequences of up to 4 million tokens or more while maintaining consistent language modeling capabilities.

arxiv情報

著者	Guoxuan Chen,Han Shi,Jiawei Li,Yihang Gao,Xiaozhe Ren,Yimeng Chen,Xin Jiang,Zhenguo Li,Weiyang Liu,Chao Huang
発行日	2025-02-24 15:42:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー