Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

要約

大規模言語モデル (LLM) を完全に微調整する際の法外なコストを軽減するために、パラメーター効率の良い微調整 (PEFT) 手法が登場しました。
それにもかかわらず、LLM のサイズが膨大であるため、日常的な展開が妨げられます。
この問題に対処するために、モデルの圧縮を促進し、推論を高速化する新しい量子化対応 PEFT 技術である、パラメータ効率および量子化対応適応 (PEQA) を紹介します。
PEQA は 2 段階のプロセスを通じて動作します。最初に、各全結合層のパラメーター行列が量子化されて、低ビット整数の行列とスカラーベクトルになります。
その後、各下流タスクのスカラーベクトルに対して微調整が行われます。
このような戦略により、モデルのサイズが大幅に圧縮され、デプロイメント時の推論待ち時間が短縮され、必要な全体のメモリが削減されます。
同時に、高速な微調整と効率的なタスク切り替えが可能になります。
このように、PEQA は PEFT の利点を継承しながら、量子化の利点を提供します。
私たちは、自然言語理解から生成ベンチマークに至るまでの包括的な実験において、PEQA を競合ベースラインと比較します。
これは、最大 650 億ドルのパラメータを持つ大規模な言語モデルを使用して行われ、PEQA のスケーラビリティ、タスク固有の適応パフォーマンス、および極度に低ビット設定であっても指示に従う能力を実証しています。

要約(オリジナル)

Parameter-efficient fine-tuning (PEFT) methods have emerged to mitigate the prohibitive cost of full fine-tuning large language models (LLMs). Nonetheless, the enormous size of LLMs impedes routine deployment. To address the issue, we present Parameter-Efficient and Quantization-aware Adaptation (PEQA), a novel quantization-aware PEFT technique that facilitates model compression and accelerates inference. PEQA operates through a dual-stage process: initially, the parameter matrix of each fully-connected layer undergoes quantization into a matrix of low-bit integers and a scalar vector; subsequently, fine-tuning occurs on the scalar vector for each downstream task. Such a strategy compresses the size of the model considerably, leading to a lower inference latency upon deployment and a reduction in the overall memory required. At the same time, fast fine-tuning and efficient task switching becomes possible. In this way, PEQA offers the benefits of quantization, while inheriting the advantages of PEFT. We compare PEQA with competitive baselines in comprehensive experiments ranging from natural language understanding to generation benchmarks. This is done using large language models of up to $65$ billion parameters, demonstrating PEQA’s scalability, task-specific adaptation performance, and ability to follow instructions, even in extremely low-bit settings.

arxiv情報

著者	Jeonghoon Kim,Jung Hyun Lee,Sungdong Kim,Joonsuk Park,Kang Min Yoo,Se Jung Kwon,Dongsoo Lee
発行日	2023-05-23 15:20:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Memory-Efficient Fine-Tuning of Compressed Large Language Models via sub-4-bit Integer Quantization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー