Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

要約

最近、大規模言語モデル (LLM) コミュニティは、非常に長いドキュメントを処理する LLM の機能を強化することに関心を高めています。
さまざまな長文技術やモデルアーキテクチャが登場するにつれて、モデルの長文機能を正確かつ詳細に評価することがますます重要になってきています。
L-Eval や LongBench などの既存の長文評価ベンチマークは、主に QA や要約タスクに焦点を当て、オープンソースデータセットに基づいて長文テストセットを構築します。
これらのデータセットには、互いに絡み合ったさまざまな長さ (2k から 32k+) のテストサンプルが含まれているため、さまざまな長さの範囲にわたるモデルの機能を評価することが困難になります。
さらに、最新の LLM が達成すると主張する超長期設定 (100k+ トークン) はカバーされていません。
この論文では、LLM の長いコンテキストの理解を評価するための長さ適応可能なベンチマークである Ada-LEval を紹介します。
Ada-LEval には、TSort と BestAnswer という 2 つの困難なサブセットが含まれており、LLM のロングコンテキスト機能のより信頼性の高い評価が可能になります。
これらのベンチマークは、テストケースの長さの複雑な操作をサポートしており、最大 128,000 トークンのテキストサンプルを簡単に生成できます。
Ada-LEval を使用して、4 つの最先端のクローズドソース API モデルと 6 つのオープンソースモデルを評価します。
評価結果は、特に超ロングコンテキスト設定における現在の LLM の限界を示しています。
私たちのコードは https://github.com/open-compass/Ada-LEval で入手できます。

要約(オリジナル)

Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs’ capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models’ long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs’ long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at https://github.com/open-compass/Ada-LEval.

arxiv情報

著者	Chonghua Wang,Haodong Duan,Songyang Zhang,Dahua Lin,Kai Chen
発行日	2024-04-10 07:40:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー