Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)

要約

客観的な構造化された臨床検査（OSCES）は、医学生のコミュニケーションスキルを評価するために広く使用されていますが、インタビューに基づく評価の得点は時間がかかり、潜在的に人間のバイアスにさらされます。
この研究では、マスターインタビュー評価尺度（MIRS）を使用してOSCE評価を自動化する大きな言語モデル（LLM）の可能性を調査しました。
4つの最先端のLLMS（GPT-4O、Claude 3.5、Llama 3.1、およびGemini 1.5 Pro）のパフォーマンスを比較して、ゼロショット、チェーン（COT）、少数のショット、およびマルチステッププロンプトの条件下で、MIRの28項目すべてのOSCE転写産物を評価しました。
モデルは、174のエキスパートコンセンサススコアを利用できる10のOSCEケースのデータセットに対してベンチマークされました。
モデルのパフォーマンスは、3つの精度メトリック（正確、オフワン、しきい値）を使用して測定されました。
すべてのmiRSアイテムとOSCEケースにわたって平均化され、LLMは正確な精度（0.27〜0.44）で実行され、オフワン1回の精度（0.67〜0.87）およびしきい値の精度（0.75〜0.88）で実行されました。
ゼロ温度パラメーターにより、高評価者内信頼性が保証されました（{\ alpha} = 0.98 GPT-4O）。
COT、少ないショット、およびマルチステップのテクニックは、特定の評価項目に合わせた場合に価値があることが証明されました。
パフォーマンスは、遭遇段階と通信ドメインとは無関係に、miRSアイテム全体で一貫していました。
AIアシストOSCE評価の実現可能性を実証し、複数の迅速なテクニックにわたる複数のLLMのベンチマークを提供しました。
私たちの仕事は、LLMSのベースラインパフォーマンス評価を提供し、臨床コミュニケーションスキルの自動評価に関する将来の研究の基礎を築きます。

要約(オリジナル)

Objective Structured Clinical Examinations (OSCEs) are widely used to assess medical students’ communication skills, but scoring interview-based assessments is time-consuming and potentially subject to human bias. This study explored the potential of large language models (LLMs) to automate OSCE evaluations using the Master Interview Rating Scale (MIRS). We compared the performance of four state-of-the-art LLMs (GPT-4o, Claude 3.5, Llama 3.1, and Gemini 1.5 Pro) in evaluating OSCE transcripts across all 28 items of the MIRS under the conditions of zero-shot, chain-of-thought (CoT), few-shot, and multi-step prompting. The models were benchmarked against a dataset of 10 OSCE cases with 174 expert consensus scores available. Model performance was measured using three accuracy metrics (exact, off-by-one, thresholded). Averaging across all MIRS items and OSCE cases, LLMs performed with low exact accuracy (0.27 to 0.44), and moderate to high off-by-one accuracy (0.67 to 0.87) and thresholded accuracy (0.75 to 0.88). A zero temperature parameter ensured high intra-rater reliability ({\alpha} = 0.98 for GPT-4o). CoT, few-shot, and multi-step techniques proved valuable when tailored to specific assessment items. The performance was consistent across MIRS items, independent of encounter phases and communication domains. We demonstrated the feasibility of AI-assisted OSCE evaluation and provided benchmarking of multiple LLMs across multiple prompt techniques. Our work provides a baseline performance assessment for LLMs that lays a foundation for future research into automated assessment of clinical communication skills.

arxiv情報

著者	Jadon Geathers,Yann Hicke,Colleen Chan,Niroop Rajashekar,Justin Sewell,Susannah Cornes,Rene F. Kizilcec,Dennis Shung
発行日	2025-05-15 17:09:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Generative AI for Scoring Medical Student Interviews in Objective Structured Clinical Examinations (OSCEs)

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー