Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

要約

多数のパラメーターにより、大規模な言語モデル（LLM）の推論フェーズはリソース集約型です。
再訓練を必要とする従来のモデル圧縮とは異なり、最近の動的計算方法は、すべてのコンポーネントが推論に必要なわけではなく、トレーニングなしのパイプラインを可能にすることを示しています。
この論文では、LLM生成の動的な深さに焦点を当てています。
パフォーマンスを維持しながら効率的に1.5倍の操作を節約するために、トークンポジションの認識レイヤースキップフレームワークが提案されています。
最初に、予測されたトークンは後で困惑し、したがって計算が少ないことを観察しました。
次に、Position-Law Decay Function、$ \ left \ lfloor l \ times（\ alpha^i）\ right \ rfloor $をレバレッジするポジショナルアウェア深度減衰デコード（$ d^3 $）と呼ばれるトレーニングフリーアルゴリズムを提案します。
驚くべきことに、再訓練がなければ、$ d^3 $は、幅広い世代のタスクにわたって初めて成功を収めています。
$ 7 \ sim 70億ドルのパラメーターを使用した大規模な言語モデル（\ ie llama）の実験は、$ d^3 $がGSM8KおよびBBHベンチャーマークでほぼパフォーマンスドロップ（$ <1 \％$）で同等のパフォーマンスを維持しながら、平均1.5倍のスピードアップを達成できることを示しています。

要約(オリジナル)

Due to the large number of parameters, the inference phase of Large Language Models (LLMs) is resource-intensive. Unlike traditional model compression, which needs retraining, recent dynamic computation methods show that not all components are required for inference, enabling a training-free pipeline. In this paper, we focus on the dynamic depth of LLM generation. A token-position aware layer skipping framework is proposed to save 1.5x times operations efficiently while maintaining performance. We first observed that tokens predicted later have lower perplexity and thus require less computation. Then, we propose a training-free algorithm called Position-Aware Depth Decay Decoding ($D^3$), which leverages a power-law decay function, $\left\lfloor L \times (\alpha^i) \right\rfloor$, to determine the number of layers to retain when generating token $T_i$. Remarkably, without any retraining, the $D^3$ achieves success across a wide range of generation tasks for the first time. Experiments on large language models (\ie the Llama) with $7 \sim 70$ billion parameters show that $D^3$ can achieve an average 1.5x speedup compared with the full-inference pipeline while maintaining comparable performance with nearly no performance drop ($<1\%$) on the GSM8K and BBH benchmarks.

arxiv情報

著者	Siqi Fan,Xuezhi Fang,Xingrun Xing,Peng Han,Shuo Shang,Yequan Wang
発行日	2025-03-11 15:15:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Position-Aware Depth Decay Decoding ($D^3$): Boosting Large Language Model Inference Efficiency

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー