Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

要約

複雑なタスクのパフォーマンスを改善し、特に臨床応用のために、大規模な言語モデル（LLMS）で解釈可能な意思決定を可能にするには、効果的な推論が必要です。
しかし、これは、クローズドソースモデル（GPT-4Oなど）から蒸留された費用のかかるチェーン（COT）データに関する監視付き微調整（SFT）なしで挑戦的なままです。
この作業では、SFTや蒸留COTデータに依存せずに、ミニマリストのルールベースの報酬を使用して、推論能力が強化学習（RL）を使用して純粋に強化学習（RL）を使用して純粋に発生する可能性があることを示す最初の医療LLMであるAlphamedを提示します。
Alphamedは、従来のSFT+RLパイプラインでトレーニングされたモデルを上回る、6つの医療QAベンチマークで最先端の結果を達成します。
挑戦的なベンチマーク（Medxpertなど）では、Alphamedは、DeepSeek-V3-671BやClaude-3.5-Sonnetなどの大型または閉鎖モデルを上回ります。
この成功の背後にある要因を理解するために、3つの質問に導かれる包括的なデータ中心分析を実施します。（i）蒸留COTの監督なしで最小リストのルールベースのRLが推論をインセンティブすることができますか？
（ii）データセットの量と多様性は推論にどのように影響しますか？
（iii）質問の難易度は、推論の出現と一般化をどのように形成しますか？
私たちの調査結果は、データセットの情報性が推論パフォーマンスの重要な推進力であり、有益な多肢選択QAデータに関するミニマリストRLがCOTの監督なしで推論を誘導するのに効果的であることを示しています。
また、ベンチマーク全体の多様な傾向、現在の評価における制限を強調し、より挑戦的で推論指向の医療QAベンチマークの必要性を強調しています。

要約(オリジナル)

Improving performance on complex tasks and enabling interpretable decision making in large language models (LLMs), especially for clinical applications, requires effective reasoning. Yet this remains challenging without supervised fine-tuning (SFT) on costly chain-of-thought (CoT) data distilled from closed-source models (e.g., GPT-4o). In this work, we present AlphaMed, the first medical LLM to show that reasoning capability can emerge purely through reinforcement learning (RL), using minimalist rule-based rewards on public multiple-choice QA datasets, without relying on SFT or distilled CoT data. AlphaMed achieves state-of-the-art results on six medical QA benchmarks, outperforming models trained with conventional SFT+RL pipelines. On challenging benchmarks (e.g., MedXpert), AlphaMed even surpasses larger or closed-source models such as DeepSeek-V3-671B and Claude-3.5-Sonnet. To understand the factors behind this success, we conduct a comprehensive data-centric analysis guided by three questions: (i) Can minimalist rule-based RL incentivize reasoning without distilled CoT supervision? (ii) How do dataset quantity and diversity impact reasoning? (iii) How does question difficulty shape the emergence and generalization of reasoning? Our findings show that dataset informativeness is a key driver of reasoning performance, and that minimalist RL on informative, multiple-choice QA data is effective at inducing reasoning without CoT supervision. We also observe divergent trends across benchmarks, underscoring limitations in current evaluation and the need for more challenging, reasoning-oriented medical QA benchmarks.

arxiv情報

著者	Che Liu,Haozhe Wang,Jiazhen Pan,Zhongwei Wan,Yong Dai,Fangzhen Lin,Wenjia Bai,Daniel Rueckert,Rossella Arcucci
発行日	2025-05-23 14:27:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond Distillation: Pushing the Limits of Medical LLM Reasoning with Minimalist Rule-Based RL

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー