LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

要約

より長いコンテキストを処理するために言語モデルを拡張すると、キー・バリュー（KV）キャッシュのコストが増大するため、メモリに大きな課題が生じる。ハイブリッドモデルの効率向上と、事前に訓練された大規模な変換器バックボーンの幅広い利用可能性に動機付けられ、より効率的な世代のために変換器モデルをハイブリッドアーキテクチャに移行することを探求する。本研究では、LLaMAのようなモデルをハイブリッド型に変換する軽量な手法であるLightTransferを提案する。我々のアプローチは、遅延レイヤー（最近のトークンや最初のトークンに注目するレイヤー）を識別し、その完全な注意をストリーミング注意に置き換える。この変換は、長い文脈を理解するタスクでは訓練なしで、より強力な推論能力を必要とするo1のような長い推論生成タスクでは最小限の微調整で実行できる。多様なベンチマークとモデル（例えば、LLaMA、Mistral、QwQ-STILL）にわたる実験により、レイヤーの半分が遅延として識別される場合でも、LightTransferは最小限の性能損失で最大2.17$times$のスループット改善を達成し（LongBenchで$<1.5%$）、高度なo1様長推論モデルQwQ-STILLの数学ベンチマークAIME24で53.3%を達成することが実証された。

要約(オリジナル)

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers — those focusing on recent or initial tokens — and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

arxiv情報

著者	Xuan Zhang,Fengzhuo Zhang,Cunxiao Du,Chao Du,Tianyu Pang,Wei Gao,Min Lin
発行日	2025-02-04 13:45:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー