Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

要約

読解テストは、教育から簡略化された文章の理解度の評価まで、さまざまな用途に使用されます。
ただし、このようなテストを手動で作成し、その品質を保証するのは難しく、時間がかかります。
この論文では、大規模言語モデル (LLM) を使用して多肢選択式の読解項目を生成および評価する方法を検討します。
この目的を達成するために、私たちはドイツ語の読解項目のデータセットを編集し、推測可能性と回答可能性に基づいたテキスト情報性と呼ばれる指標を含む、人間による自動評価のための新しいプロトコルを開発しました。
次に、このプロトコルとデータセットを使用して、Llama 2 と GPT-4 によって生成されたアイテムの品質を評価しました。
私たちの結果は、どちらのモデルもゼロショット設定で許容可能な品質のアイテムを生成できることを示唆していますが、GPT-4 のパフォーマンスは明らかに Llama 2 よりも優れています。また、LLM からアイテムの応答を引き出すことで自動評価に LLM を使用できることも示しています。
このシナリオでは、GPT-4 による評価結果がヒューマンアノテーターに最も類似していました。
全体として、LLM を使用したゼロショット生成は、特に大量の利用可能なデータがない言語の場合、読解テスト項目を生成および評価するための有望なアプローチです。

要約(オリジナル)

Reading comprehension tests are used in a variety of applications, reaching from education to assessing the comprehensibility of simplified texts. However, creating such tests manually and ensuring their quality is difficult and time-consuming. In this paper, we explore how large language models (LLMs) can be used to generate and evaluate multiple-choice reading comprehension items. To this end, we compiled a dataset of German reading comprehension items and developed a new protocol for human and automatic evaluation, including a metric we call text informativity, which is based on guessability and answerability. We then used this protocol and the dataset to evaluate the quality of items generated by Llama 2 and GPT-4. Our results suggest that both models are capable of generating items of acceptable quality in a zero-shot setting, but GPT-4 clearly outperforms Llama 2. We also show that LLMs can be used for automatic evaluation by eliciting item reponses from them. In this scenario, evaluation results with GPT-4 were the most similar to human annotators. Overall, zero-shot generation with LLMs is a promising approach for generating and evaluating reading comprehension test items, in particular for languages without large amounts of available data.

arxiv情報

著者	Andreas Säuberli,Simon Clematide
発行日	2024-04-11 13:11:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Automatic Generation and Evaluation of Reading Comprehension Test Items with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー