BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

要約

近年、大規模言語モデル (LLM) の入力コンテキストサイズが劇的に増加しています。
しかし、既存の評価方法は追いついておらず、長いコンテキストを処理する際のモデルの効率を包括的に評価することができません。
このギャップを埋めるために、非常に長い文書に分散された事実全体を推論する言語モデルの能力をテストするように設計された BABILong ベンチマークを導入します。
BABILong には、事実連鎖、単純な帰納、演繹、計数、リスト/セットの処理など、20 個の推論タスクの多様なセットが含まれています。
これらのタスクはそれ自体でも困難ですが、必要な事実が長い自然テキストに散在している場合はさらに困難になります。
私たちの評価によると、一般的な LLM はコンテキストの 10 ～ 20\% しか効果的に利用しておらず、推論の複雑さが増すとパフォーマンスが急激に低下します。
コンテキスト内推論の代替手段の中でも、検索拡張生成手法は、コンテキストの長さに関係なく、単一事実の質問応答で 60% という適度な精度を達成します。
コンテキスト拡張手法の中で最も高いパフォーマンスが発揮されるのはリカレントメモリトランスフォーマーであり、最大 1,100 万トークンの長さの処理が可能です。
BABILong ベンチマークは、機能が強化された新しい今後のモデルの評価をサポートするために任意の長さに拡張可能であり、最大 100 万トークン長までの分割を提供します。

要約(オリジナル)

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models’ ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

arxiv情報

著者	Yuri Kuratov,Aydar Bulatov,Petr Anokhin,Ivan Rodkin,Dmitry Sorokin,Artyom Sorokin,Mikhail Burtsev
発行日	2024-06-14 16:00:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー