NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

要約

大規模な言語モデル（LLMS）の最近の進歩により、特に長文との理解において、自然言語処理の境界が推進されています。
ただし、これらのモデルの長いコンテキスト能力の評価は、現在のベンチマークの制限により、依然として課題のままです。
このギャップに対処するために、複雑で拡張された物語を持つLLMSを評価するために調整されたベンチマークであるNovelqaを紹介します。
英語の小説から構築されたNovelqaは、複雑さ、長さ、物語の一貫性のユニークなブレンドを提供し、LLMSの深いテキスト理解を評価するための理想的なツールになります。
このペーパーでは、包括的な手動注釈プロセスと、微妙な理解の評価を目的としたさまざまな質問タイプに焦点を当てたNovelqaの設計と構築について詳しく説明しています。
Novelqaに関する長いコンテキストLLMの評価は、その長所と短所に関する重要な洞察を明らかにしています。
特に、モデルは、マルチホップの推論、詳細指向の質問、および平均長さが200,000トークンを超える非常に長い入力の処理に苦労しています。
結果は、LLMSの実質的な進歩の必要性を強調して、長文と書かれた理解を高め、計算文学分析に効果的に貢献しています。

要約(オリジナル)

Recent advancements in Large Language Models (LLMs) have pushed the boundaries of natural language processing, especially in long-context understanding. However, the evaluation of these models’ long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark tailored for evaluating LLMs with complex, extended narratives. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs. This paper details the design and construction of NovelQA, focusing on its comprehensive manual annotation process and the variety of question types aimed at evaluating nuanced comprehension. Our evaluation of long-context LLMs on NovelQA reveals significant insights into their strengths and weaknesses. Notably, the models struggle with multi-hop reasoning, detail-oriented questions, and handling extremely long inputs, with average lengths exceeding 200,000 tokens. Results highlight the need for substantial advancements in LLMs to enhance their long-context comprehension and contribute effectively to computational literary analysis.

arxiv情報

著者	Cunxiang Wang,Ruoxi Ning,Boqi Pan,Tonghui Wu,Qipeng Guo,Cheng Deng,Guangsheng Bao,Xiangkun Hu,Zheng Zhang,Qian Wang,Yue Zhang
発行日	2025-04-23 12:52:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー