Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues

要約

研究によると、人間の家庭教師に具体的かつタイムリーなフィードバックを提供すると、パフォーマンスが向上することがわかっています。
ただし、人間の評価者による家庭教師のパフォーマンスの評価には時間がかかるため、課題が生じます。
AI チャットボット ChatGPT などの大規模な言語モデルは、実践的な環境で講師に建設的なフィードバックを提供できる可能性を秘めています。
それにもかかわらず、AI が生成するフィードバックの精度は依然として不確実であり、ChatGPT のようなモデルが効果的なフィードバックを提供する能力を調査した研究はほとんどありません。
この進行中の作業では、家庭教師と生徒の設定で GPT-4 によって生成された 30 の対話を評価します。
私たちは、ゼロショットの思考連鎖と少数ショットの思考連鎖という 2 つの異なる促しアプローチを使用して、5 つの基準に基づいて効果的な賞賛の具体的な要素を特定します。
これらのアプローチは、精度について人間の採点者の結果と比較されます。
私たちの目標は、GPT-4 が各賞賛基準をどの程度正確に識別できるかを評価することです。
私たちは、ゼロショットと少数ショットの思考連鎖アプローチの両方が同等の結果をもたらすことを発見しました。
GPT-4 は、家庭教師が具体的かつ即座に褒める場合の特定において、ある程度のパフォーマンスを発揮します。
ただし、GPT-4 は、特に誠実な家庭教師の褒め言葉の例が提供されていないゼロショットプロンプトシナリオでは、家庭教師が心からの賞賛を与える能力を特定する能力が不十分です。
今後の作業は、プロンプトエンジニアリングの強化、より一般的な個別指導ルーブリックの開発、および実際の個別指導対話を使用したメソッドの評価に焦点を当てます。

要約(オリジナル)

Research suggests that providing specific and timely feedback to human tutors enhances their performance. However, it presents challenges due to the time-consuming nature of assessing tutor performance by human evaluators. Large language models, such as the AI-chatbot ChatGPT, hold potential for offering constructive feedback to tutors in practical settings. Nevertheless, the accuracy of AI-generated feedback remains uncertain, with scant research investigating the ability of models like ChatGPT to deliver effective feedback. In this work-in-progress, we evaluate 30 dialogues generated by GPT-4 in a tutor-student setting. We use two different prompting approaches, the zero-shot chain of thought and the few-shot chain of thought, to identify specific components of effective praise based on five criteria. These approaches are then compared to the results of human graders for accuracy. Our goal is to assess the extent to which GPT-4 can accurately identify each praise criterion. We found that both zero-shot and few-shot chain of thought approaches yield comparable results. GPT-4 performs moderately well in identifying instances when the tutor offers specific and immediate praise. However, GPT-4 underperforms in identifying the tutor’s ability to deliver sincere praise, particularly in the zero-shot prompting scenario where examples of sincere tutor praise statements were not provided. Future work will focus on enhancing prompt engineering, developing a more general tutoring rubric, and evaluating our method using real-life tutoring dialogues.

arxiv情報

著者	Dollaya Hirunyasiri,Danielle R. Thomas,Jionghao Lin,Kenneth R. Koedinger,Vincent Aleven
発行日	2023-07-05 04:14:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Comparative Analysis of GPT-4 and Human Graders in Evaluating Praise Given to Students in Synthetic Dialogues

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー