Prompt-prompted Mixture of Experts for Efficient LLM Generation

要約

トランスフォーマーベースの大規模言語モデル (LLM) の開発により、その顕著な有用性により多くの分野に適用されてきましたが、導入時にはかなりの計算コストがかかります。
幸いなことに、プルーニングやエキスパートの混合 (MoE) の構築などの一部の方法は、トランスフィードフォワード (FF) ブロックのスパース性を利用して、速度の向上とメモリ要件の削減を実現することを目的としています。
ただし、これらの手法は多くの場合トレーニングが必要であったり、特定の種類のアーキテクチャに限定されたりするため、実際には非常にコストがかかり、柔軟性に欠ける可能性があります。
これに対処するために、さまざまな非 ReLU 活性化関数を持つ多数の LLM にわたって効率的に生成するために、シーケンスレベルでユニークな FF エキスパートを選択する、トレーニング不要の新しい MoE である GRIFFIN を紹介します。
これは、多くのトレーニングされた LLM がシーケンス内で高度に構造化された FF 活性化パターン (フロックと呼ばれる) を自然に生成するという重要な観察により可能になります。
私たちの方法の単純さにもかかわらず、FF パラメーターの 50% を使用して、GRIFFIN はさまざまな分類および生成タスクでほとんどまたはまったく低下することなく元のモデルのパフォーマンスを維持しながら、レイテンシーを改善しています (例: 1.25$\times$ の高速化)。
NVIDIA L40 上の Llama 2 13B)。
コードは https://github.com/hdong920/GRIFFIN で入手できます。

要約(オリジナル)

With the development of transformer-based large language models (LLMs), they have been applied to many fields due to their remarkable utility, but this comes at a considerable computational cost at deployment. Fortunately, some methods such as pruning or constructing a mixture of experts (MoE) aim at exploiting sparsity in transformer feedforward (FF) blocks to gain boosts in speed and reduction in memory requirements. However, these techniques can be very costly and inflexible in practice, as they often require training or are restricted to specific types of architectures. To address this, we introduce GRIFFIN, a novel training-free MoE that selects unique FF experts at the sequence level for efficient generation across a plethora of LLMs with different non-ReLU activation functions. This is possible due to a critical observation that many trained LLMs naturally produce highly structured FF activation patterns within a sequence, which we call flocking. Despite our method’s simplicity, we show with 50% of the FF parameters, GRIFFIN maintains the original model’s performance with little to no degradation on a variety of classification and generation tasks, all while improving latency (e.g. 1.25$\times$ speed-up in Llama 2 13B on an NVIDIA L40). Code is available at https://github.com/hdong920/GRIFFIN.

arxiv情報

著者	Harry Dong,Beidi Chen,Yuejie Chi
発行日	2024-04-05 14:31:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prompt-prompted Mixture of Experts for Efficient LLM Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー