Disentangling Reasoning and Knowledge in Medical Large Language Models

要約

大規模な言語モデル（LLMS）の医学的推論は、臨床医の診断思考をエミュレートすることを目的としていますが、MEDQA-USMLE、MEDMCQA、PUBMEDQAなどの現在のベンチマークは、多くの場合、事実のリコールと推論を混合します。
これに対処し、11の生物医学QAベンチマークを、人間のパフォーマンスに匹敵する81％の精度に達するPubMedbert分類器を使用して、推論および知識に焦点を当てたサブセットに分離します。
私たちの分析は、質問の32.8％だけが複雑な推論が必要であることを示しています。
生物医学モデル（Huatuogpt-O1、MedReason、M1）および一般的なドメインモデル（DeepSeek-R1、O4-Mini、QWEN3）を評価し、知識と推論パフォーマンスの間に一貫したギャップを見つけます。
たとえば、M1は知識で60.5を獲得しましたが、推論では47.1のみです。
モデルが誤った初期推論と誤解されている敵対的なテストでは、生物医学モデルが大幅に低下し、より大きいまたはRLトレーニングされた一般的なモデルはより堅牢性を示します。
これに対処するために、推論が多い例で微調整および強化学習を使用して生物型R1を訓練します。
同様のサイズのモデルで最も強力なパフォーマンスを実現します。
臨床症例報告と敵対的およびバックトラッキングシナリオを備えたトレーニングを組み込むことから、さらなる利益が得られる可能性があります。

要約(オリジナル)

Medical reasoning in large language models (LLMs) aims to emulate clinicians’ diagnostic thinking, but current benchmarks such as MedQA-USMLE, MedMCQA, and PubMedQA often mix reasoning with factual recall. We address this by separating 11 biomedical QA benchmarks into reasoning- and knowledge-focused subsets using a PubMedBERT classifier that reaches 81 percent accuracy, comparable to human performance. Our analysis shows that only 32.8 percent of questions require complex reasoning. We evaluate biomedical models (HuatuoGPT-o1, MedReason, m1) and general-domain models (DeepSeek-R1, o4-mini, Qwen3), finding consistent gaps between knowledge and reasoning performance. For example, m1 scores 60.5 on knowledge but only 47.1 on reasoning. In adversarial tests where models are misled with incorrect initial reasoning, biomedical models degrade sharply, while larger or RL-trained general models show more robustness. To address this, we train BioMed-R1 using fine-tuning and reinforcement learning on reasoning-heavy examples. It achieves the strongest performance among similarly sized models. Further gains may come from incorporating clinical case reports and training with adversarial and backtracking scenarios.

arxiv情報

著者	Rahul Thapa,Qingyang Wu,Kevin Wu,Harrison Zhang,Angela Zhang,Eric Wu,Haotian Ye,Suhana Bedi,Nevin Aresh,Joseph Boen,Shriya Reddy,Ben Athiwaratkun,Shuaiwen Leon Song,James Zou
発行日	2025-05-16 17:16:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Disentangling Reasoning and Knowledge in Medical Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー