MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

要約

医師も患者も、大規模な言語モデル（LLM）を使用して臨床症例を診断するようになります。
ただし、最終的な回答によって正しさを客観的に定義できる数学やコーディングなどのドメインとは異なり、医療診断には結果と推論プロセスの両方が正確である必要があります。
現在、MEDQAやMMLUなどの広く使用されている医療ベンチマークは、臨床推論プロセスの品質と忠実さを見落とす最終回答の精度のみを評価しています。
この制限に対処するために、LLMSを評価するための最初のオープンアクセスデータセットであるMedcaserasoningを紹介します。
データセットには、14,489の診断質問と回答のケースが含まれており、それぞれがオープンアクセスの医療症例報告から派生した詳細な推論ステートメントと組み合わされています。
MedCaseraseasoningの最先端の推論LLMを評価し、診断と推論に重要な欠点を見つけます。たとえば、最高のパフォーマンスのオープンソースモデルであるDeepSeek-R1は、臨床医の推論声明の64％のみを達成し、臨床医の推論声明の64％のみを達成します。
ただし、MedCaserasiningから導き出された推論トレースでの微調整LLMは、診断精度と臨床推論リコールがそれぞれ29％と41％の平均相対ゲインによって大幅に改善されることを実証します。
オープンソースデータセット、コード、およびモデルは、https：//github.com/kevinwu23/stanford-medcaseraseasoningで入手できます。

要約(オリジナル)

Doctors and patients alike increasingly use Large Language Models (LLMs) to diagnose clinical cases. However, unlike domains such as math or coding, where correctness can be objectively defined by the final answer, medical diagnosis requires both the outcome and the reasoning process to be accurate. Currently, widely used medical benchmarks like MedQA and MMLU assess only accuracy in the final answer, overlooking the quality and faithfulness of the clinical reasoning process. To address this limitation, we introduce MedCaseReasoning, the first open-access dataset for evaluating LLMs on their ability to align with clinician-authored diagnostic reasoning. The dataset includes 14,489 diagnostic question-and-answer cases, each paired with detailed reasoning statements derived from open-access medical case reports. We evaluate state-of-the-art reasoning LLMs on MedCaseReasoning and find significant shortcomings in their diagnoses and reasoning: for instance, the top-performing open-source model, DeepSeek-R1, achieves only 48% 10-shot diagnostic accuracy and mentions only 64% of the clinician reasoning statements (recall). However, we demonstrate that fine-tuning LLMs on the reasoning traces derived from MedCaseReasoning significantly improves diagnostic accuracy and clinical reasoning recall by an average relative gain of 29% and 41%, respectively. The open-source dataset, code, and models are available at https://github.com/kevinwu23/Stanford-MedCaseReasoning.

arxiv情報

著者	Kevin Wu,Eric Wu,Rahul Thapa,Kevin Wei,Angela Zhang,Arvind Suresh,Jacqueline J. Tao,Min Woo Sun,Alejandro Lozano,James Zou
発行日	2025-05-20 15:56:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MedCaseReasoning: Evaluating and learning diagnostic reasoning from clinical case reports

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー