Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

要約

大規模な言語モデル（LLM）のコンテキスト制限が増加すると、可能なアプリケーションとダウンストリーム関数の範囲が広がります。
多くの現実世界のタスクでは、決定は、ほとんど無関係な情報を含むしばしば異なる文書のコレクションに散らばる詳細に依存します。
ロングコンテキストLLMは、この形式の複雑な情報検索と推論に適しているように見えます。これは、従来、費用がかかり、時間がかかります。
ただし、長いコンテキストモデルの開発により、近年急速に利益が得られていますが、LLMがコンテキストをどのように効果的に使用するかについての理解はペースを維持していません。
これに対処するために、コンテキストウィンドウを介して情報のスレッドをたどる能力など、17の主要なLLMの機能を評価するために設計された一連の検索実験を実施します。
驚くべきことに、多くのモデルが非常にスレッドセーフであることがわかります。パフォーマンスが大幅に失われることなく、複数のスレッドを同時に追跡できることです。
それでも、多くのモデルでは、有効なコンテキスト制限は、サポートされているコンテキストの長さよりも大幅に短く、コンテキストウィンドウが増えるにつれて精度が低下します。
また、私たちの研究は、異なるトークンザーからのトークンカウントを直接比較すべきではないという重要な点を強調しています。それらは、多くの場合、かなり異なる数の文字に対応しています。
コードとロングコンテキストの実験データをリリースします。

要約(オリジナル)

As the context limits of Large Language Models (LLMs) increase, the range of possible applications and downstream functions broadens. In many real-world tasks, decisions depend on details scattered across collections of often disparate documents containing mostly irrelevant information. Long-context LLMs appear well-suited to this form of complex information retrieval and reasoning, which has traditionally proven costly and time-consuming. However, although the development of longer context models has seen rapid gains in recent years, our understanding of how effectively LLMs use their context has not kept pace. To address this, we conduct a set of retrieval experiments designed to evaluate the capabilities of 17 leading LLMs, such as their ability to follow threads of information through the context window. Strikingly, we find that many models are remarkably threadsafe: capable of simultaneously following multiple threads without significant loss in performance. Still, for many models, we find the effective context limit is significantly shorter than the supported context length, with accuracy decreasing as the context window grows. Our study also highlights the important point that token counts from different tokenizers should not be directly compared — they often correspond to substantially different numbers of written characters. We release our code and long-context experimental data.

arxiv情報

著者	Jonathan Roberts,Kai Han,Samuel Albanie
発行日	2025-04-23 07:50:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー