MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

要約

長いコンテキスト理解（LCU）は、現在の大手言語モデル（LLMS）の探索の重要な領域です。
ただし、ロングテキストデータの本質的に長い性質により、LLMの既存のLCUベンチマークは、テスト時間や推論費用など、非常に高い評価コストをもたらすことがよくあります。
広範な実験を通じて、既存のLCUベンチマークが有意な冗長性を示すことがわかります。これは、評価の非効率性を意味します。
このホワイトペーパーでは、まばらな情報特性を備えたロングテキストデータに合わせた簡潔なデータ圧縮法を提案します。
よく知られているLCUベンチマークロングベンチを剪定することで、Minilongbenchを作成します。
このベンチマークには、6つの主要なタスクカテゴリと21の異なるタスクにわたる237のテストサンプルのみが含まれます。
60を超えるLLMの経験的分析により、Minilongbenchは平均評価コストを元の4.5％にわずか4.5％に削減しながら、ロングベンチの結果で平均ランク相関係数を0.97に維持します。
したがって、私たちのMinilongbenchは、低コストのベンチマークとして、LLMSのLCU能力に関する将来の研究を実質的に推進する大きな可能性を秘めています。
コード、データ、チュートリアルについては、https：//github.com/milkthink-lab/minilongbenchを参照してください。

要約(オリジナル)

Long Context Understanding (LCU) is a critical area for exploration in current large language models (LLMs). However, due to the inherently lengthy nature of long-text data, existing LCU benchmarks for LLMs often result in prohibitively high evaluation costs, like testing time and inference expenses. Through extensive experimentation, we discover that existing LCU benchmarks exhibit significant redundancy, which means the inefficiency in evaluation. In this paper, we propose a concise data compression method tailored for long-text data with sparse information characteristics. By pruning the well-known LCU benchmark LongBench, we create MiniLongBench. This benchmark includes only 237 test samples across six major task categories and 21 distinct tasks. Through empirical analysis of over 60 LLMs, MiniLongBench achieves an average evaluation cost reduced to only 4.5% of the original while maintaining an average rank correlation coefficient of 0.97 with LongBench results. Therefore, our MiniLongBench, as a low-cost benchmark, holds great potential to substantially drive future research into the LCU capabilities of LLMs. See https://github.com/MilkThink-Lab/MiniLongBench for our code, data and tutorial.

arxiv情報

著者	Zhongzhan Huang,Guoming Ling,Shanshan Zhong,Hefeng Wu,Liang Lin
発行日	2025-05-26 13:21:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MiniLongBench: The Low-cost Long Context Understanding Benchmark for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー