FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing

要約

自然言語処理（NLP）における大規模な言語モデル（LLMS）の急速な増殖は、パフォーマンスを損なうことなく、メモリが制約されたデバイスで効率的に展開できるようにする技術の重要な必要性を生み出しました。
重要なスコアに基づいてモデルブロックを選択的にプルーン化し、低パラメーター置換戦略に置き換えるLLMSをプルネする方法を提示します。
具体的には、モデルおよびブロック固有の低ランクアダプターからの非普通のカウンターパートを活用する重量共有メカニズムを使用して、各プルーニングブロックを置き換える原則的なメトリックを提案します。
さらに、出力機能の正規化と、低ランクSVD再構成に基づいて構築されたアダプター初期化スキームを使用して、これらの交換ブロックの学習を促進します。
経験的評価は、既存の方法でかなりのパフォーマンスの向上を示し、5/6ベンチマークで最先端のパフォーマンスを達成し、圧縮率は30％と6/6ベンチマークで40％のベンチマークで6/6/6/6ベンチマークを達成します。
また、私たちのアプローチがより小さなモデルを拡張できることを実証し、最小限のパラメーターコストで拡張トレーニングの約0.3％トークンのみを使用して、6/6ベンチマークのパフォーマンスを向上させることができます。

要約(オリジナル)

The rapid proliferation of large language models (LLMs) in natural language processing (NLP) has created a critical need for techniques that enable efficient deployment on memory-constrained devices without compromising performance. We present a method to prune LLMs that selectively prunes model blocks based on an importance score and replaces them with a low-parameter replacement strategy. Specifically, we propose a principled metric to replace each pruned block using a weight-sharing mechanism that leverages unpruned counterparts from the model and block-specific low-rank adapters. Furthermore, we facilitate the learning of these replacement blocks with output feature normalization and an adapter initialization scheme built on low-rank SVD reconstructions. Empirical evaluations demonstrate substantial performance gains over existing methods, achieving state-of-the-art performance on 5/6 benchmarks for a compression rate of 30% and 6/6 benchmarks for a compression rate of 40%. We also demonstrate that our approach can extend smaller models, boosting performance on 6/6 benchmarks using only ~0.3% tokens of extended training with minimal additional parameter costs.

arxiv情報

著者	James Seale Smith,Chi-Heng Lin,Shikhar Tuli,Haris Jeelani,Shangqian Gao,Yilin Shen,Hongxia Jin,Yen-Chang Hsu
発行日	2025-01-31 17:38:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

FlexiGPT: Pruning and Extending Large Language Models with Low-Rank Weight Sharing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー