LongHeads: Multi-Head Attention is Secretly a Long Context Processor

要約

大規模言語モデル (LLM) は、多くのドメインで目覚ましいパフォーマンスを達成していますが、限られた長さの一般化と注意の二次計算要求により、長い入力を効果的かつ効率的に処理するのに苦労することがよくあります。
多くの人は、事前に訓練された長さ内に注意ウィンドウを制限することでこれを軽減しようとしました。
ただし、これらの方法では、中間のコンテキストが無視されたり、追加のトレーニングが必要になったりするなど、新たな問題が生じます。
これらの問題に対処するために、私たちは、マルチヘッドアテンションの未開発の可能性を解き放つことによって LLM のロングコンテキスト能力を強化する、トレーニング不要のフレームワークである LongHeads を提案します。
配信外 (OOD) の問題により、より長いシーケンスに一般化するのに苦労する各ヘッドが文全体に注目できるようにする代わりに、重要なコンテキストチャンクを選択して注目することで、各ヘッドが配信内の長さを処理できるようにします。
この目的を達成するために、クエリとキー表現の間の固有の相関関係に依存し、コンテキストチャンクを異なるヘッドに効率的に分散するチャンク選択戦略を提案します。
このようにして、各ヘッドは、トレーニングされた長さ内で有向トークンを効果的に処理できることを保証し、同時に異なるレイヤーの異なるヘッドがより長いコンテキストを集合的に処理できるようにします。
LongHeads は線形時間で効率的に動作し、相対位置エンコーディングを使用する多くの LLM とシームレスに適合します。
LongHeads は、パスキー取得タスクで 128k の長さで 100% の精度を達成し、既存のモデルで使用可能なコンテキストウィンドウを拡張する際の LongHeads の有効性を検証しました。
コードは https://github.com/LuLuLuyi/LongHeads でリリースされています。

要約(オリジナル)

Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention’s quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM’s long context ability by unlocking multi-head attention’s untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently in linear time, fits seamlessly with many LLMs that use relative positional encoding. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task, verifying LongHeads’s efficacy in extending the usable context window for existing models. We release our code at https://github.com/LuLuLuyi/LongHeads .

arxiv情報

著者	Yi Lu,Xin Zhou,Wei He,Jun Zhao,Tao Ji,Tao Gui,Qi Zhang,Xuanjing Huang
発行日	2024-03-25 11:50:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LongHeads: Multi-Head Attention is Secretly a Long Context Processor

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー