Literary Evidence Retrieval via Long-Context Language Models

要約

現代のロングコンテクスト言語モデルは、文学的フィクションをどの程度理解しているのだろうか？我々は、Thatら(2022)のRELiCデータセットを再利用して、一次資料（例えば、『グレート・ギャツビー』）の全文が、その作品からの引用が欠落している文学批評と一緒にLLMに提供されるベンチマークを構築し、文学的証拠検索のタスクを通してこの疑問を探求する。モデルが欠落した引用を生成しなければならないこの設定は、モデルがグローバルな物語推論とテキスト精査の両方を行うことを要求することで、人間の文学分析プロセスを反映している。我々は、広範なフィルタリングと人間による検証を通して、292例の高品質なサブセットをキュレートする。我々の実験は、Gemini Pro 2.5のような最近の推論モデルが、人間の専門家のパフォーマンスを上回ることができることを示している（62.5%対50%の精度）。対照的に、最良のオープンウェイトモデルは29.1%の精度しか達成できず、オープンウェイトモデルとクローズドウェイトモデルの間の解釈推論における大きな隔たりが浮き彫りになった。その速度と見かけの正確さにもかかわらず、最強のモデルでさえ、ニュアンスに富んだ文学的シグナルや過剰生成に苦戦しており、LLMを文学分析に適用するための未解決の課題を示唆している。この方向での将来の研究を奨励するために、データセットと評価コードを公開する。

要約(オリジナル)

How well do modern long-context language models understand literary fiction? We explore this question via the task of literary evidence retrieval, repurposing the RELiC dataset of That et al. (2022) to construct a benchmark where the entire text of a primary source (e.g., The Great Gatsby) is provided to an LLM alongside literary criticism with a missing quotation from that work. This setting, in which the model must generate the missing quotation, mirrors the human process of literary analysis by requiring models to perform both global narrative reasoning and close textual examination. We curate a high-quality subset of 292 examples through extensive filtering and human verification. Our experiments show that recent reasoning models, such as Gemini Pro 2.5 can exceed human expert performance (62.5% vs. 50% accuracy). In contrast, the best open-weight model achieves only 29.1% accuracy, highlighting a wide gap in interpretive reasoning between open and closed-weight models. Despite their speed and apparent accuracy, even the strongest models struggle with nuanced literary signals and overgeneration, signaling open challenges for applying LLMs to literary analysis. We release our dataset and evaluation code to encourage future work in this direction.

arxiv情報

著者	Katherine Thai,Mohit Iyyer
発行日	2025-06-03 17:19:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Literary Evidence Retrieval via Long-Context Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー