Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

要約

大規模な言語モデルは、主に自己関節メカニズムの実装により、近年顕著な成功を収めています。
ただし、従来のソフトマックスの注意は、推論トークンの長さが増加するにつれて、数値の不安定性とパフォーマンスの低下に苦しんでいます。
このペーパーでは、ソフトマックス操作を非線形変換と$ L_1 $ -NORMに分解することにより、これらの問題に対処します。
後者をモデルのパフォーマンスを維持するために不可欠であると特定します。
非線形変換をSoftPlusアクティベーション関数に置き換え、不変性エントロピーに基づいたさまざまなトークンの長さの動的なスケール係数を導入することにより、さまざまな推論長にわたって従来のソフトマックスの注意よりも優れたパフォーマンスを備えた新しい注意メカニズムを作成します。
提案された注意メカニズムの長さの外挿能力をさらに向上させるために、より弱いものを減少させながら重大な注意の重みを増幅する微調整なしの再重視メカニズムを導入し、再試行を必要とせずに関連するトークンにモデルをより効果的に集中させることができます。
提案された注意メカニズムと組み合わせると、このアプローチは、長いシーケンスを管理することに大きな約束を示し、数値の安定性を確保しながら、トレーニングトークンの長さ16ドルでもほぼ一定の検証損失を維持します。
私たちのコードは、https：//github.com/iminfine/freeattenで入手できます。

要約(オリジナル)

Large language models have achieved remarkable success in recent years, primarily due to the implementation of self-attention mechanisms. However, traditional Softmax attention suffers from numerical instability and reduced performance as the length of inference tokens increases. This paper addresses these issues by decomposing the Softmax operation into a non-linear transformation and the $l_1$-norm. We identify the latter as essential for maintaining model performance. By replacing the non-linear transformation with the Softplus activation function and introducing a dynamic scale factor for different token lengths based on invariance entropy, we create a novel attention mechanism with performance better than conventional Softmax attention across various inference lengths. To further improve the length extrapolation ability of the proposed attention mechanism, we introduce a fine-tuning-free re-weighting mechanism that amplifies significant attention weights while diminishing weaker ones, enabling the model to concentrate more effectively on relevant tokens without requiring retraining. When combined with our proposed attention mechanism, this approach demonstrates significant promise in managing longer sequences, maintaining nearly constant validation loss even at 16$\times$ the training token length while ensuring numerical stability. Our code is available at: https://github.com/iminfine/freeatten.

arxiv情報

著者	Bo Gao,Michael W. Spratling
発行日	2025-01-27 11:58:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー