MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

要約

医療用に厳選されたデータセットは、専門家による人による注釈が必要なため、制限されることがよくあります。
このペーパーでは、医療用の言語モデルの開発を促進するためのマルチレベル、マルチタスク、マルチドメインの医療ベンチマークである MedEval について紹介します。
MedEval は包括的で、いくつかの医療システムからのデータで構成されており、8 つの検査方法からの 35 の人体の部位に及びます。
収集された 22,779 の文と 21,228 のレポートにより、複数のレベルで専門家の注釈が提供され、データの詳細な潜在的な使用法が提供され、幅広いタスクがサポートされます。
さらに、ヘルスケアにおけるドメインに適応したベースラインから汎用の最先端の大規模言語モデル (ChatGPT など) まで、ゼロショットおよび微調整設定の下で 10 の汎用およびドメイン固有の言語モデルを体系的に評価しました。
私たちの評価では、さまざまなタスクにわたる 2 つのカテゴリの言語モデルの有効性が異なることが明らかになり、そこから、大規模な言語モデルを数回使用するための命令チューニングの重要性に気づきました。
私たちの調査は、医療向けの言語モデルのベンチマークへの道を切り開き、医療分野で大規模な言語モデルを採用することの強みと限界について貴重な洞察を提供し、実際のアプリケーションと将来の進歩に情報を提供します。

要約(オリジナル)

Curated datasets for healthcare are often limited due to the need of human annotations from experts. In this paper, we present MedEval, a multi-level, multi-task, and multi-domain medical benchmark to facilitate the development of language models for healthcare. MedEval is comprehensive and consists of data from several healthcare systems and spans 35 human body regions from 8 examination modalities. With 22,779 collected sentences and 21,228 reports, we provide expert annotations at multiple levels, offering a granular potential usage of the data and supporting a wide range of tasks. Moreover, we systematically evaluated 10 generic and domain-specific language models under zero-shot and finetuning settings, from domain-adapted baselines in healthcare to general-purposed state-of-the-art large language models (e.g., ChatGPT). Our evaluations reveal varying effectiveness of the two categories of language models across different tasks, from which we notice the importance of instruction tuning for few-shot usage of large language models. Our investigation paves the way toward benchmarking language models for healthcare and provides valuable insights into the strengths and limitations of adopting large language models in medical domains, informing their practical applications and future advancements.

arxiv情報

著者	Zexue He,Yu Wang,An Yan,Yao Liu,Eric Y. Chang,Amilcare Gentili,Julian McAuley,Chun-Nan Hsu
発行日	2023-10-27 16:00:49+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MedEval: A Multi-Level, Multi-Task, and Multi-Domain Medical Benchmark for Language Model Evaluation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー