Sirius: Contextual Sparsity with Correction for Efficient LLMs

要約

大規模言語モデル (LLM) の開花に伴い、推論効率がますます重要になっています。
推論時のコストを削減するために、さまざまな近似手法が提案されています。
コンテキストスパーシティ (CS) は、トレーニング不要の性質と、一見品質を劣化させることなくより高い圧縮率に到達できる能力が魅力です。
しかし、さまざまな複雑な生成タスクに対するコンテキストスパーシティ手法の包括的な評価を行った結果、CS は迅速な理解タスクには成功するものの、推論、演繹、知識ベースのタスクではモデルのパフォーマンスが大幅に低下することがわかりました。
エンドツーエンドの精度に差があるにもかかわらず、疎モデルは一般的な問題解決ロジックを共有していることが多く、元のモデルのパフォーマンスを回復するために必要なトークンの修正はわずかであることがわかりました。
この論文では、効率的な補正メカニズムである Sirius を紹介します。これは、効率の向上を維持しながら、推論タスクで CS モデルの品質を大幅に回復します。
Sirius は、推論、数学、コーディングにおける 8 つの難しい生成タスクを含む 6 つのモデルで評価され、一貫した有効性と効率性を示しています。
また、Sirius のシステム実装を慎重に開発し、Sirius が 8B モデルのオンチップでレイテンシーを約 20% 削減し、70B モデルのオフロードで 35% の削減を達成することを示しました。
Sirius の実装は https://github.com/Infini-AI-Lab/Sirius.git でオープンソース化されています。

要約(オリジナル)

With the blossom of large language models (LLMs), inference efficiency becomes increasingly important. Various approximation methods are proposed to reduce the cost at inference time. Contextual Sparsity (CS) is appealing for its training-free nature and its ability to reach a higher compression ratio seemingly without quality degradation. However, after a comprehensive evaluation of contextual sparsity methods on various complex generation tasks, we find that although CS succeeds in prompt-understanding tasks, CS significantly degrades the model performance for reasoning, deduction, and knowledge-based tasks. Despite the gap in end-to-end accuracy, we observed that sparse models often share general problem-solving logic and require only a few token corrections to recover the original model performance. This paper introduces Sirius, an efficient correction mechanism, which significantly recovers CS models quality on reasoning tasks while maintaining its efficiency gain. Sirius is evaluated on 6 models with 8 difficult generation tasks in reasoning, math, and coding and shows consistent effectiveness and efficiency. Also, we carefully develop a system implementation for Sirius and show that Sirius achieves roughly 20% reduction in latency for 8B model on-chip and 35% reduction for 70B model offloading. We open-source our implementation of Sirius at https://github.com/Infini-AI-Lab/Sirius.git.

arxiv情報

著者	Yang Zhou,Zhuoming Chen,Zhaozhuo Xu,Victoria Lin,Beidi Chen
発行日	2024-09-05 18:38:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Sirius: Contextual Sparsity with Correction for Efficient LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー