KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

要約

大学レベルの英語コースでは、毎年何千万もの小論文が書かれ、採点されている。学生は、精読として知られるプロセスを通じて文学的・文化的テキストを分析するよう求められる。精読はクリティカル・シンキングの基礎とみなされ、大学の授業で必修科目として広く採用されているにもかかわらず、これまで大規模な言語モデル（LLM）で評価されたことはなく、MMLUのような複数分野のベンチマークには、文学は科目として含まれていません。このギャップを埋めるために、我々は解釈的推論を評価するための最初の精読ベンチマークであるKRISTEVAを発表する。KRISTEVAでは、LLMが文学作品をどの程度理解し、推論できるかをテストするために、精読プロセスのさまざまな要素を近似した、徐々に難しくなる3つの課題セットを提案する：1)文体特徴の抽出、2)パラメトリック知識からの関連文脈情報の検索、3)文体と外部文脈間のマルチホップ推論である。我々のベースラインの結果では、最先端のLLMは大学レベルの精読能力（精度49.7％～69.7％）を持っているものの、11のタスクのうち10において、その性能は経験豊富な人間の評価者よりも劣っていることがわかった。

要約(オリジナル)

Each year, tens of millions of essays are written and graded in college-level English courses. Students are asked to analyze literary and cultural texts through a process known as close reading, in which they gather textual details to formulate evidence-based arguments. Despite being viewed as a basis for critical thinking and widely adopted as a required element of university coursework, close reading has never been evaluated on large language models (LLMs), and multi-discipline benchmarks like MMLU do not include literature as a subject. To fill this gap, we present KRISTEVA, the first close reading benchmark for evaluating interpretive reasoning, consisting of 1331 multiple-choice questions adapted from classroom data. With KRISTEVA, we propose three progressively more difficult sets of tasks to approximate different elements of the close reading process, which we use to test how well LLMs may seem to understand and reason about literary works: 1) extracting stylistic features, 2) retrieving relevant contextual information from parametric knowledge, and 3) multi-hop reasoning between style and external contexts. Our baseline results find that, while state-of-the-art LLMs possess some college-level close reading competency (accuracy 49.7% – 69.7%), their performances still trail those of experienced human evaluators on 10 out of our 11 tasks.

arxiv情報

著者	Peiqi Sui,Juan Diego Rodriguez,Philippe Laban,Dean Murphy,Joseph P. Dexter,Richard Jean So,Samuel Baker,Pramit Chaudhuri
発行日	2025-06-03 15:11:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

KRISTEVA: Close Reading as a Novel Task for Benchmarking Interpretive Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー