Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms

要約

大規模言語モデル (LLM) の長さの外挿機能を向上させることは、自然言語処理における重要な課題のままです。
最近の取り組みの多くは、スケーリングされたドット積アテンションメカニズムの修正に焦点を当てており、厳密な理論的根拠がないままスケーリングされた温度を導入することがよくあります。
このギャップを埋めるために、情報エントロピーの不変性に基づく新しいアプローチを導入します。
長さの外挿を強化するために、2 つの新しいスケール化された温度を提案します。
まず、トレーニング不要のメソッド InfoScale は、ドット積の注目を目的として設計されており、情報エントロピーの一貫性を確保することで、長さの外挿中に元のトークンへの注目を維持します。
次に、コサインアテンションに対するスケーリング (CosScale) の影響を理論的に分析します。
実験データは、InfoScale と CosScale を組み合わせることで、トレーニング長の 64 倍に拡張されたコンテキストウィンドウを備えた GAU-{\alpha} モデルで最先端のパフォーマンスを達成し、既存の 7 つの方法を上回るパフォーマンスを示していることを示しています。
私たちの分析では、CosScale の大幅な増加がウィンドウ化された注意力に近似していることを明らかにし、長距離コンテキスト処理における重要な課題としての注意力スコアの希薄化の重要性を強調しています。
コードとデータは https://github.com/HT-NEKO/InfoScale で入手できます。

要約(オリジナル)

Improving the length extrapolation capabilities of Large Language Models (LLMs) remains a critical challenge in natural language processing. Many recent efforts have focused on modifying the scaled dot-product attention mechanism, and often introduce scaled temperatures without rigorous theoretical justification. To fill this gap, we introduce a novel approach based on information entropy invariance. We propose two new scaled temperatures to enhance length extrapolation. First, a training-free method InfoScale is designed for dot-product attention, and preserves focus on original tokens during length extrapolation by ensuring information entropy remains consistent. Second, we theoretically analyze the impact of scaling (CosScale) on cosine attention. Experimental data demonstrates that combining InfoScale and CosScale achieves state-of-the-art performance on the GAU-{\alpha} model with a context window extended to 64 times the training length, and outperforms seven existing methods. Our analysis reveals that significantly increasing CosScale approximates windowed attention, and highlights the significance of attention score dilution as a key challenge in long-range context handling. The code and data are available at https://github.com/HT-NEKO/InfoScale.

arxiv情報

著者	Kewei Li,Yanwen Kong,Yiping Xu,Lan Huang,Ruochi Zhang,Fengfeng Zhou
発行日	2025-01-15 04:32:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Information Entropy Invariance: Enhancing Length Extrapolation in Attention Mechanisms

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー