NoLiMa: Long-Context Evaluation Beyond Literal Matching

要約

最近の大規模な言語モデル（LLMS）は、128K〜1Mトークンの範囲の長いコンテキストをサポートしています。
これらの機能を評価するための一般的な方法は、ヘイスタックの針（NIAH）テストです。これには、「ヘイスタック」（長い無関係なコンテキスト）から「針」（関連情報）を取得することが含まれます。
このアプローチの拡張には、ディストラクタの増加、ファクトチェーン、およびコンテキスト内の推論が含まれます。
ただし、これらのベンチマークでは、モデルはタスクを簡素化するために針と干し草のスタックの間の既存のリテラルマッチを活用できます。
これに対処するために、慎重に設計された針セットでニアを拡張するベンチマークであるノリマを紹介します。ここでは、質問と針が最小限の語彙オーバーラップを備えており、ヘイスタック内の針を見つけるための潜在的な関連性を推測するモデルが必要です。
少なくとも128Kトークンのコンテキストをサポートすると主張する12の一般的なLLMを評価します。
短いコンテキスト（<1K）ではうまく機能しますが、コンテキストの長さが増加するにつれてパフォーマンスは大幅に低下します。たとえば、32Kでは、10モデルが強力な短い長さのベースラインの50％を下回ります。トップパフォーマンスの例外の1つであるGPT-4Oでさえ、99.3％から69.7％のほぼ完璧なベースラインから減少を経験します。私たちの分析は、これらの衰退が、文字通りの一致がないときに長いコンテキストで注意メカニズムが直面する難易度の増加から生じ、関連する情報を取得することを難しくすることを示唆しています。

要約(オリジナル)

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a ‘needle’ (relevant information) from a ‘haystack’ (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

arxiv情報

著者	Ali Modarressi,Hanieh Deilamsalehy,Franck Dernoncourt,Trung Bui,Ryan A. Rossi,Seunghyun Yoon,Hinrich Schütze
発行日	2025-02-07 18:49:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NoLiMa: Long-Context Evaluation Beyond Literal Matching

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー