Benchmarking Large Language Models on CMExam — A Comprehensive Chinese Medical Exam Dataset

要約

大規模言語モデル (LLM) の最近の進歩により、質問応答 (QA) の分野は変化しました。
ただし、標準化された包括的なデータセットが不足しているため、医療分野で LLM を評価することは困難です。
このギャップに対処するために、中国の国家医師免許試験をソースとした CMExam を導入します。
CMExam は、標準化された客観的な評価のための 60,000 以上の多肢選択問題と、オープンエンド形式のモデル推論評価のための解決策の説明で構成されています。
LLM を詳細に分析するために、医療専門家に、疾患グループ、診療科、医療専門分野、能力分野、質問の難易度など、5 つの追加の質問ごとの注釈にラベルを付けるよう依頼しました。
データセットに加えて、CMExam で代表的な LLM と QA アルゴリズムを使用して徹底的な実験をさらに実施しました。
結果は、GPT-4 が 61.5% の最高の精度と 0.616 の加重 F1 スコアを持っていることを示しています。
これらの結果は、人間の精度 (71.6%) と比較すると大きな差があることを浮き彫りにしています。
説明タスクの場合、LLM は適切な推論を生成し、微調整後にパフォーマンスの向上を実証できましたが、望ましい標準には達しておらず、改善の余地が十分にあることが示されています。
私たちの知る限り、CMExam は包括的な医療注釈を提供する初の中国の医療検査データセットです。
LLM 評価の実験と結果は、中国の医療 QA システムと LLM 評価パイプラインの開発における課題と潜在的な解決策についての貴重な洞察も提供します。
データセットと関連コードは https://github.com/williamliujl/CMExam で入手できます。

要約(オリジナル)

Recent advancements in large language models (LLMs) have transformed the field of question answering (QA). However, evaluating LLMs in the medical field is challenging due to the lack of standardized and comprehensive datasets. To address this gap, we introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels. Alongside the dataset, we further conducted thorough experiments with representative LLMs and QA algorithms on CMExam. The results show that GPT-4 had the best accuracy of 61.5% and a weighted F1 score of 0.616. These results highlight a great disparity when compared to human accuracy, which stood at 71.6%. For explanation tasks, while LLMs could generate relevant reasoning and demonstrate improved performance after finetuning, they fall short of a desired standard, indicating ample room for improvement. To the best of our knowledge, CMExam is the first Chinese medical exam dataset to provide comprehensive medical annotations. The experiments and findings of LLM evaluation also provide valuable insights into the challenges and potential solutions in developing Chinese medical QA systems and LLM evaluation pipelines. The dataset and relevant code are available at https://github.com/williamliujl/CMExam.

arxiv情報

著者	Junling Liu,Peilin Zhou,Yining Hua,Dading Chong,Zhongyu Tian,Andrew Liu,Helin Wang,Chenyu You,Zhenhua Guo,Lei Zhu,Michael Lingzhi Li
発行日	2023-06-05 16:48:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking Large Language Models on CMExam — A Comprehensive Chinese Medical Exam Dataset

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー