When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

要約

自己回帰大規模言語モデル (LLM) は、言語タスクで優れたパフォーマンスを達成していますが、2 つの重大なボトルネックに直面しています: (1) トークンの数が増加するにつれて、アテンションモジュールの 2 次の複雑さ、(2) 自己回帰の逐次処理の性質による効率の制限
生成中の LLM。
線形アテンションと投機的デコードは潜在的な解決策を提供しますが、自己回帰 LLM を強化するためのそれらの適用性と相乗効果の可能性は依然として不確実です。
我々は、自己回帰LLMに対する既存の線形注意法の有効性について、それらを投機的復号と統合して、最初の包括的な研究を実施します。
投機的デコーディングとの互換性を保証する線形アテンションのための拡張手法を導入し、LLM のより効率的なトレーニングと提供を可能にします。
7 つの既存の線形アテンションモデルと 5 つのエンコーダ/デコーダベースの LLM を含む広範な実験とアブレーション研究により、当社の拡張線形化 LLM の有効性が一貫して検証されています。
特に、私たちのアプローチは、従来の線形アテンション手法と比較して、LLaMA モデルの複雑度を最大 6.67 削減し、生成中に最大 2$\times$ の高速化を達成します。
コードとモデルは https://github.com/GATECH-EIC/Linearized-LLM で入手できます。

要約(オリジナル)

Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2$\times$ speedup during generation compared to prior linear attention methods. Codes and models are available at https://github.com/GATECH-EIC/Linearized-LLM.

arxiv情報

著者	Haoran You,Yichao Fu,Zheng Wang,Amir Yazdanbakhsh,Yingyan,Lin
発行日	2024-06-11 15:34:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー