LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

要約

大規模言語モデル (LLM) の推論を高速化するエンドツーエンドのソリューションである LayerSkip を紹介します。
まず、トレーニング中に、初期の層のドロップアウト率が低く、後の層のドロップアウト率が高い層ドロップアウトと、すべての変圧器層が同じ出口を共有する早期の出口損失を適用します。
次に、推論中に、モデルに補助層やモジュールを追加することなく、このトレーニングレシピにより、初期層での早期終了の精度が向上することを示します。
第三に、初期の層で終了し、モデルの残りの層で検証および修正する、新しい自己投機的復号ソリューションを提案します。
私たちが提案する自己投機的デコードアプローチは、他の投機的デコードアプローチよりもメモリフットプリントが少なく、共有コンピューティングとドラフトおよび検証段階のアクティブ化による利点があります。
私たちは、さまざまなタイプのトレーニング (ゼロからの事前トレーニング、継続的な事前トレーニング、特定のデータドメインでの微調整、特定のタスクでの微調整) で、さまざまな Llama モデルサイズの実験を実行します。
推論ソリューションを実装したところ、CNN/DM ドキュメントの要約で最大 2.16 倍、コーディングで 1.82 倍、TOPv2 セマンティック解析タスクで 2.0 倍の高速化が見られました。

要約(オリジナル)

We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs). First, during training we apply layer dropout, with low dropout rates for earlier layers and higher dropout rates for later layers, and an early exit loss where all transformer layers share the same exit. Second, during inference, we show that this training recipe increases the accuracy of early exit at earlier layers, without adding any auxiliary layers or modules to the model. Third, we present a novel self-speculative decoding solution where we exit at early layers and verify and correct with remaining layers of the model. Our proposed self-speculative decoding approach has less memory footprint than other speculative decoding approaches and benefits from shared compute and activations of the draft and verification stages. We run experiments on different Llama model sizes on different types of training: pretraining from scratch, continual pretraining, finetuning on specific data domain, and finetuning on specific task. We implement our inference solution and show speedups of up to 2.16x on summarization for CNN/DM documents, 1.82x on coding, and 2.0x on TOPv2 semantic parsing task.

arxiv情報

著者	Mostafa Elhoushi,Akshat Shrivastava,Diana Liskovich,Basil Hosmer,Bram Wasti,Liangzhen Lai,Anas Mahmoud,Bilge Acun,Saurabh Agarwal,Ahmed Roman,Ahmed A Aly,Beidi Chen,Carole-Jean Wu
発行日	2024-04-29 15:02:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー