Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

要約

注意と状態空間モデル（SSM）を組み合わせたハイブリッドLLMアーキテクチャは、最先端の精度とランタイムパフォーマンスを実現します。
最近の研究では、注意のみのモデルに圧縮と蒸留を適用すると、トレーニングコストのほんの一部でより小さく、より正確なモデルが得られることが実証されています。
この作業では、ハイブリッドアーキテクチャの圧縮の有効性を調査します。
SSMブロックの構造的完全性とそのシーケンスモデリング機能を保持する新しいグループ認識剪定戦略を紹介します。
さらに、従来のアプローチと比較して、精度と推論の速度を向上させるために、このようなSSM剪定の必要性を示しています。
圧縮レシピは、SSM、FFN、埋め込み寸法、および層剪定を組み合わせて、それに続いてMinitron技術と同様に知識蒸留ベースの再訓練が続きます。
このアプローチを使用して、Nemotron-H 8Bハイブリッドモデルを4Bパラメーターに圧縮し、最大40倍のトレーニングトークンが少なくなります。
結果として得られるモデルは、2倍のより速い推論を達成しながら、同様のサイズのモデルの精度を上回り、パレートフロンティアを大幅に進めます。

要約(オリジナル)

Hybrid LLM architectures that combine Attention and State Space Models (SSMs) achieve state-of-the-art accuracy and runtime performance. Recent work has demonstrated that applying compression and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. We introduce a novel group-aware pruning strategy that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. Furthermore, we demonstrate the necessity of such SSM pruning to achieve improved accuracy and inference speed compared to traditional approaches. Our compression recipe combines SSM, FFN, embedding dimension, and layer pruning, followed by knowledge distillation-based retraining, similar to the MINITRON technique. Using this approach, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to 40x fewer training tokens. The resulting model surpasses the accuracy of similarly-sized models while achieving 2x faster inference, significantly advancing the Pareto frontier.

arxiv情報

著者	Ali Taghibakhshi,Sharath Turuvekere Sreenivas,Saurav Muralidharan,Marcin Chochowski,Yashaswi Karnati,Raviraj Joshi,Ameya Sunil Mahabaleshwarkar,Zijia Chen,Yoshi Suhara,Oluwatobi Olabiyi,Daniel Korzekwa,Mostofa Patwary,Mohammad Shoeybi,Jan Kautz,Bryan Catanzaro,Ashwath Aithal,Nima Tajbakhsh,Pavlo Molchanov
発行日	2025-04-15 17:26:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー