Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

要約

テストがどのように教育評価において項目に答えるかを知ることは、テスト開発、アイテムの品質を評価し、テストの妥当性を改善するために不可欠です。
ただし、このプロセスには通常、人間の参加者との広範なパイロット研究が必要です。
大規模な言語モデル（LLM）がテスト項目に人間のような反応行動を示す場合、これはパイロット参加者としてそれらを使用してテスト開発を加速する可能性を開く可能性があります。
このホワイトペーパーでは、18の命令チューニングLLMSからの応答の人間性または心理測定の妥当性を、3つの科目で複数選択テスト項目の2つの公開されたデータセットを使用して、読書、米国の歴史、経済学を評価します。
私たちの方法論は、教育評価、古典的なテスト理論、アイテム応答理論で一般的に使用される精神測量からの2つの理論的枠組みに基づいています。
結果は、より大きなモデルは過度に自信を持っていますが、温度スケーリングで較正されると、反応分布がより人間のようになる可能性があることを示しています。
さらに、LLMは、他の被験者と比較して、読解項目の人間とよりよく相関する傾向があることがわかります。
ただし、相関関係は全体的にそれほど強力ではなく、LLMがゼロショット設定で教育評価を試験するために使用されるべきではないことを示しています。

要約(オリジナル)

Knowing how test takers answer items in educational assessments is essential for test development, to evaluate item quality, and to improve test validity. However, this process usually requires extensive pilot studies with human participants. If large language models (LLMs) exhibit human-like response behavior to test items, this could open up the possibility of using them as pilot participants to accelerate test development. In this paper, we evaluate the human-likeness or psychometric plausibility of responses from 18 instruction-tuned LLMs with two publicly available datasets of multiple-choice test items across three subjects: reading, U.S. history, and economics. Our methodology builds on two theoretical frameworks from psychometrics which are commonly used in educational assessment, classical test theory and item response theory. The results show that while larger models are excessively confident, their response distributions can be more human-like when calibrated with temperature scaling. In addition, we find that LLMs tend to correlate better with humans in reading comprehension items compared to other subjects. However, the correlations are not very strong overall, indicating that LLMs should not be used for piloting educational assessments in a zero-shot setting.

arxiv情報

著者	Andreas Säuberli,Diego Frassinelli,Barbara Plank
発行日	2025-06-11 14:41:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do LLMs Give Psychometrically Plausible Responses in Educational Assessments?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー