Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

要約

微調整は、大規模言語モデル (LLM) をさまざまなアプリケーションに適応させるための重要なプロセスです。
マルチテナントサービスなどの特定のシナリオでは、複雑な要求を満たすために複数の LLM を展開することが必要になります。
最近の研究では、微調整された LLM を基本モデルと対応するデルタ重みに分解し、その後、コストを削減するために低ランクまたは低ビットのアプローチを使用して圧縮することが提案されています。
この研究では、既存の低ランクおよび低ビットの圧縮手法が、タスク固有の微調整された LLM (数学問題の WizardMath など) のモデルのパフォーマンスに重大な悪影響を与える可能性があることを観察しました。
デルタ重みにおける特異値のロングテール分布を動機として、混合精度を使用したデルタ量子化アプローチを提案します。
この方法では、より大きな特異値に対応する特異ベクトルの上位ビット表現が使用されます。
私たちは、数学 LLM、コード LLM、チャット LLM、さらには VLM を含む、さまざまな微調整された LLM に対するアプローチを評価します。
実験結果は、私たちのアプローチが完全に微調整された LLM と同等のパフォーマンスを示し、低ランクおよび低ビットのベースラインの両方をかなりのマージンで上回っていることを示しています。
さらに、私たちの方法が Llama-2、Llama-3、Mistral などのさまざまなバックボーン LLM と互換性があることを示し、その一般化可能性を強調します。

要約(オリジナル)

Fine-tuning is a crucial process for adapting large language models (LLMs) to diverse applications. In certain scenarios, such as multi-tenant serving, deploying multiple LLMs becomes necessary to meet complex demands. Recent studies suggest decomposing a fine-tuned LLM into a base model and corresponding delta weights, which are then compressed using low-rank or low-bit approaches to reduce costs. In this work, we observe that existing low-rank and low-bit compression methods can significantly harm the model performance for task-specific fine-tuned LLMs (e.g., WizardMath for math problems). Motivated by the long-tail distribution of singular values in the delta weights, we propose a delta quantization approach using mixed-precision. This method employs higher-bit representation for singular vectors corresponding to larger singular values. We evaluate our approach on various fine-tuned LLMs, including math LLMs, code LLMs, chat LLMs, and even VLMs. Experimental results demonstrate that our approach performs comparably to full fine-tuned LLMs, surpassing both low-rank and low-bit baselines by a considerable margin. Additionally, we show that our method is compatible with various backbone LLMs, such as Llama-2, Llama-3, and Mistral, highlighting its generalizability.

arxiv情報

著者	Bowen Ping,Shuo Wang,Hanqing Wang,Xu Han,Yuzhuang Xu,Yukun Yan,Yun Chen,Baobao Chang,Zhiyuan Liu,Maosong Sun
発行日	2024-11-20 07:42:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Delta-CoMe: Training-Free Delta-Compression with Mixed-Precision for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー