FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

要約

事前トレーニング済み言語モデルには多数のパラメーターがあるため、パフォーマンスが向上しますが、リソースを大量に消費するため、単一の GPU などの汎用ハードウェアに展開することが困難になります。
これらのデバイスのメモリと電力の制限により、モデルのサイズと推論レイテンシの両方を削減するために、モデル圧縮技術がよく使用されます。
これは通常、モデルの精度と効率の間のトレードオフになります。
したがって、汎用ハードウェアに LLM を効果的に導入するには、このバランスを最適化することが不可欠です。
効率化の課題の重要な部分はフィードフォワードネットワーク (FFN) コンポーネントであり、これが合計パラメータと推論レイテンシのおよそ $\frac{2}{3}$ を占めます。
この論文では、FFN モジュールの少数のニューロンだけが入力トークン、別名ヘビーヒッターに対して大きな出力ノルムを持ち、他のニューロンは異なるトークンによってまばらにトリガーされることを最初に観察しました。
この観察に基づいて、ヘビーヒッターに従って FFN を 2 つの部分に明示的に分割しました。
強力な FFN 部分により多くのリソースを割り当てることで、既存の圧縮方法の効率と精度のトレードオフを改善します。
実際、私たちの方法はモデルサイズを 43.1\% 削減し、精度の低下は無視できる程度に、異なるハードウェア上で実時間の速度を $1.25\sim1.56\times$ 向上させることができます。

要約(オリジナル)

The large number of parameters in Pretrained Language Models enhance their performance, but also make them resource-intensive, making it challenging to deploy them on commodity hardware like a single GPU. Due to the memory and power limitations of these devices, model compression techniques are often used to decrease both the model’s size and its inference latency. This usually results in a trade-off between model accuracy and efficiency. Therefore, optimizing this balance is essential for effectively deploying LLMs on commodity hardware. A significant portion of the efficiency challenge is the Feed-forward network (FFN) component, which accounts for roughly $\frac{2}{3}$ total parameters and inference latency. In this paper, we first observe that only a few neurons of FFN module have large output norm for any input tokens, a.k.a. heavy hitters, while the others are sparsely triggered by different tokens. Based on this observation, we explicitly split the FFN into two parts according to the heavy hitters. We improve the efficiency-accuracy trade-off of existing compression methods by allocating more resource to FFN parts with heavy hitters. In practice, our method can reduce model size by 43.1\% and bring $1.25\sim1.56\times$ wall clock time speedup on different hardware with negligible accuracy drop.

arxiv情報

著者	Zirui Liu,Qingquan Song,Qiang Charles Xiao,Sathiya Keerthi Selvaraj,Rahul Mazumder,Aman Gupta,Xia Hu
発行日	2024-01-08 17:29:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FFSplit: Split Feed-Forward Network For Optimizing Accuracy-Efficiency Trade-off in Language Model Inference

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー