Minerva: A Programmable Memory Test Benchmark for Language Models

要約

LLMベースのAIアシスタントは、メモリ（コンテキスト）をどの程度効果的に利用してさまざまなタスクを実行できますか？
しばしば手動で作られている従来のデータベンチマークは、いくつかの制限に苦しんでいます。それらは静的で、過剰適合の影響を受けやすく、解釈が困難であり、実用的な洞察を欠いています。
このペーパーでは、モデルのメモリを効果的に使用する能力を評価するための包括的なテストセットを自動的に生成するためのフレームワークを提示します。
私たちのフレームワークは、一般的に検討されている（PassKey、Key-Value、Haystackの針）検索を超えて、能力テストの範囲を拡張します。これは、文献の支配的な焦点です。
具体的には、検索、リコール、編集、マッチング、コンテキストメモリ内の情報の比較、入力が異なるブロックに構造化されたときに基本操作の実行などの原子タスクのモデルを評価し、実際のデータをシミュレートします。
さらに、複合テストを設計して、メモリ上で動作しながら状態を維持するモデルの能力を調査します。
当社のベンチマークにより、LLMSのメモリ能力の解釈可能で詳細な評価が可能になります。

要約(オリジナル)

How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights–failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models’ abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, and performing basic operations when inputs are structured into distinct blocks, simulating real-world data. Additionally, we design composite tests to investigate the models’ ability to maintain state while operating on memory. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.

arxiv情報

著者	Menglin Xia,Victor Ruehle,Saravan Rajmohan,Reza Shokri
発行日	2025-02-05 16:53:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Minerva: A Programmable Memory Test Benchmark for Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー