MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

要約

医療分野におけるさまざまな医療大規模言語モデル (LLM) の出現により、LLM の手動評価は時間と労力がかかることが判明しているため、統一された評価基準の必要性が浮き彫りになりました。
この問題に対処するために、中国医学分野の包括的なベンチマークである MedBench を紹介します。これは、さまざまな医学分野の本格的な検査演習と医療レポートから得られた 40,041 の質問で構成されています。
特に、このベンチマークは、中国医師免許試験、研修医標準化研修試験、主治医資格試験、および検査、診断、治療を含む実際の臨床事例の 4 つの主要な要素で構成されています。
MedBench は、中国本土の医師の教育の進歩と臨床実践の経験を再現することで、医療言語学習モデルにおける知識の習得と推論能力を評価するための信頼できるベンチマークとしての地位を確立しています。
当社は広範な実験を実施し、さまざまな観点から詳細な分析を行っており、その結果、次のような結果が得られました。 (1) 中国の医療 LLM はこのベンチマークでパフォーマンスを下回っており、臨床知識と診断精度の大幅な進歩の必要性が浮き彫りになっています。
(2) いくつかの一般領域 LLM は、驚くべきことにかなりの医学知識を持っています。
これらの調査結果は、医学研究コミュニティを支援するという最終目標を掲げ、MedBench のコンテキスト内での LLM の機能と限界の両方を明らかにしています。

要約(オリジナル)

The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be time-consuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establishing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community.

arxiv情報

著者	Yan Cai,Linlin Wang,Ye Wang,Gerard de Melo,Ya Zhang,Yanfeng Wang,Liang He
発行日	2023-12-20 07:01:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー