Adaptive Pruning for Large Language Models with Structural Importance Awareness

要約

大規模言語モデル (LLM) の最近の進歩により、言語の理解と生成の機能が大幅に向上しました。
ただし、計算リソースとストレージリソースの要求が高いため、リソースに制約のあるエッジデバイスに LLM を展開するのは困難です。
この問題に対処するために、モデルのパフォーマンスを維持しながら計算コストとメモリコストを大幅に削減する、新しい LLM モデルプルーニング手法、つまり構造を意識した適応枝刈り (SAAP) を提案します。
まず、適応的重要度融合メトリクスを定義し、等分散性の不確実性を考慮して LLM 内のすべての結合構造の重要性を評価します。
次に、すべてのモジュールの重要性をランク付けして、特定のパフォーマンス要件を満たすために削除する必要がある特定のレイヤーを決定します。
さらに、LLM の推論効率を向上させるための新しいグループ微調整戦略を開発します。
最後に、ゼロショット分類とテキスト生成という 2 つの共通タスクにわたって複数の LLM で提案された SAAP 手法を評価します。
実験結果は、当社の SAAP メソッドがいくつかの最先端のベースラインメソッドを上回り、LLaMA-7B、Vicuna-7B、および LLaMA-13B で 2.17%、2.37%、および 2.39% の精度向上を達成したことを示しています。
さらに、SAAP はトークンの生成速度を 5% 向上させ、リソースに制約のあるシナリオで実際的な利点を示します。

要約(オリジナル)

The recent advancements in large language models (LLMs) have significantly improved language understanding and generation capabilities. However, it is difficult to deploy LLMs on resource-constrained edge devices due to their high computational and storage resource demands. To address this issue, we propose a novel LLM model pruning method, namely structurally-aware adaptive pruning (SAAP), to significantly reduce the computational and memory costs while maintaining model performance. We first define an adaptive importance fusion metric to evaluate the importance of all coupled structures in LLMs by considering their homoscedastic uncertainty. Then, we rank the importance of all modules to determine the specific layers that should be pruned to meet particular performance requirements. Furthermore, we develop a new group fine-tuning strategy to improve the inference efficiency of LLMs. Finally, we evaluate the proposed SAAP method on multiple LLMs across two common tasks, i.e., zero-shot classification and text generation. Experimental results show that our SAAP method outperforms several state-of-the-art baseline methods, achieving 2.17%, 2.37%, and 2.39% accuracy gains on LLaMA-7B, Vicuna-7B, and LLaMA-13B. Additionally, SAAP improves the token generation speed by 5%, showcasing its practical advantages in resource-constrained scenarios.

arxiv情報

著者	Haotian Zheng,Jinke Ren,Yushan Sun,Ruichen Zhang,Wenbo Zhang,Zhen Li,Dusit Niyato,Shuguang Cui,Yatong Han
発行日	2024-12-19 18:08:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adaptive Pruning for Large Language Models with Structural Importance Awareness

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー