An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

要約

大規模言語モデル (LLM) は、人間との対話において大きな成功を収めています。
しかし、最近の研究では、これらのモデルはしばしば幻覚に悩まされ、自信過剰で誤った判断につながることが明らかになりました。
これにより、タスクに最高の精度が要求される医療分野での応用が制限されます。
この文書では、複数ターンの診察中に仮想医師としての LLM の実践能力を評価する自動評価フレームワークを紹介します。
相談タスクは、LLM が自分たちの知らないことを認識し、不足している医療情報について患者に質問し、最終的に診断を下すことを要求するように設計されています。
これらのタスクに対する LLM のパフォーマンスを評価するために、米国医師免許試験 (USMLE) の医療多肢選択問題を再定式化することでベンチマークが提案され、包括的な評価指標が開発され、3 つの構築されたテストセットで評価されます。
LLM の相談能力を向上させるために、医療相談トレーニングセットがさらに構築されます。
実験の結果は、トレーニングセットを微調整することで幻覚を軽減し、提案されたベンチマークでの LLM のパフォーマンスを向上させることができることを示しています。
提案されたフレームワークの有効性と堅牢性を検証するために、広範な実験とアブレーション研究が行われています。

要約(オリジナル)

Large language models (LLMs) have achieved significant success in interacting with human. However, recent studies have revealed that these models often suffer from hallucinations, leading to overly confident but incorrect judgments. This limits their application in the medical domain, where tasks require the utmost accuracy. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations. Consultation tasks are designed to require LLMs to be aware of what they do not know, to inquire about missing medical information from patients, and to ultimately make diagnoses. To evaluate the performance of LLMs for these tasks, a benchmark is proposed by reformulating medical multiple-choice questions from the United States Medical Licensing Examinations (USMLE), and comprehensive evaluation metrics are developed and evaluated on three constructed test sets. A medical consultation training set is further constructed to improve the consultation ability of LLMs. The results of the experiments show that fine-tuning with the training set can alleviate hallucinations and improve LLMs’ performance on the proposed benchmark. Extensive experiments and ablation studies are conducted to validate the effectiveness and robustness of the proposed framework.

arxiv情報

著者	Yusheng Liao,Yutong Meng,Hongcheng Liu,Yanfeng Wang,Yu Wang
発行日	2023-09-05 09:24:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー