Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

要約

最近、動的計算手法は、精巧なヒューリスティックまたは追加の予測子によっていくつかの計算層をスキップすることにより、大規模言語モデル (LLM) の顕著な高速化を示しています。
ただし、既存のアプローチの復号プロセスでは、異なるサンプルに異なる計算量が割り当てられるため、安定した正確な加速効果を保証できません。
さらに、既存のアプローチは通常、レイヤーの最下部または最上部で複数の連続するレイヤーをスキップするため、モデルのレイヤーごとの表現が大幅に変更され、その結果としてパフォーマンスが低下します。
そこで、目標高速化率のみに基づいて計算をスキップする層の数を選択し、対応する数の中間層の計算をバランスよくスキップする統合層スキップ戦略を提案します。
統合レイヤスキッピング戦略は入力サンプルから独立しているため、バッチデコードや KV キャッシュなどの一般的な高速化手法を自然にサポートし、実際のアプリケーションでの実用性が高くなります。
2 つの一般的なタスク、つまり機械翻訳とテキストの要約に関する実験結果は、目標の高速化率が与えられた場合、統合レイヤースキッピング戦略により、既存の動的アプローチと比較して推論パフォーマンスと実際のモデルのスループットの両方が大幅に向上することを示しています。

要約(オリジナル)

Recently, dynamic computation methods have shown notable acceleration for Large Language Models (LLMs) by skipping several layers of computations through elaborate heuristics or additional predictors. However, in the decoding process of existing approaches, different samples are assigned different computational budgets, which cannot guarantee a stable and precise acceleration effect. Furthermore, existing approaches generally skip multiple contiguous layers at the bottom or top of the layers, leading to a drastic change in the model’s layer-wise representations, and thus a consequent performance degeneration. Therefore, we propose a Unified Layer Skipping strategy, which selects the number of layers to skip computation based solely on the target speedup ratio, and then skips the corresponding number of intermediate layer computations in a balanced manner. Since the Unified Layer Skipping strategy is independent of input samples, it naturally supports popular acceleration techniques such as batch decoding and KV caching, thus demonstrating more practicality for real-world applications. Experimental results on two common tasks, i.e., machine translation and text summarization, indicate that given a target speedup ratio, the Unified Layer Skipping strategy significantly enhances both the inference performance and the actual model throughput over existing dynamic approaches.

arxiv情報

著者	Yijin Liu,Fandong Meng,Jie Zhou
発行日	2024-04-10 12:12:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accelerating Inference in Large Language Models with a Unified Layer Skipping Strategy

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー