Why Does the Effective Context Length of LLMs Fall Short?

要約

分散トレーニングと効率的なアテンションメカニズムの進歩により、大規模言語モデル (LLM) のコンテキストウィンドウサイズが大幅に拡大されました。
ただし、最近の研究では、オープンソース LLM の有効なコンテキスト長が不足することが多く、通常はトレーニング長の半分を超えないことが明らかになりました。
この研究では、この制限は、LLM のトレーニング前およびトレーニング後の段階で形成される相対位置の左に歪んだ周波数分布が原因であり、これが遠方の情報を効果的に収集する能力を妨げていると考えています。
この課題に対処するために、Shifted Rotray 位置埋め込み (STRING) を導入します。
STRING は、十分にトレーニングされた位置を移動して、推論中に元の非効果的な位置を上書きし、既存のトレーニング長さ内でパフォーマンスを向上させます。
実験結果によると、追加のトレーニングを行わなくても、STRING は Llama3.1 70B や Qwen2 72B などの最新の大規模モデルのパフォーマンスを、人気のあるロングコンテキストベンチマークである RULER や InfiniteBench で 10 ポイント以上大幅に向上させ、新たな状態を確立しました。
-オープンソース LLM の最先端の結果。
商用モデルと比較すると、 \method を備えた Llama 3.1 70B は GPT-4-128K よりも優れたパフォーマンスを実現し、明らかに Claude 2 や Kimi-chat を上回っています。

要約(オリジナル)

Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of large language models (LLMs). However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

arxiv情報

著者	Chenxin An,Jun Zhang,Ming Zhong,Lei Li,Shansan Gong,Yao Luo,Jingjing Xu,Lingpeng Kong
発行日	2024-10-24 13:51:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Why Does the Effective Context Length of LLMs Fall Short?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー