Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

要約

医療における大規模言語モデル (LLM) の使用は増加していますが、構造化された電子医療記録 (EHR) データと非構造化臨床メモの両方を処理する LLM の機能については十分に研究されていません。
この研究では、有名なデータセットを利用した非生成医療タスク向けに、GPT ベースの LLM、BERT ベースのモデル、従来の臨床予測モデルなどのさまざまなモデルのベンチマークを行っています。
MIMIC データセット (ICU 患者記録) と TJH データセット (初期の COVID-19 EHR データ) を使用して、14 の言語モデル (9 つの GPT ベースと 5 つの BERT ベース) と 7 つの従来の予測モデルを評価し、死亡率や死亡率などのタスクに焦点を当てました。
再入院予測、疾患階層の再構築、生体医学文マッチングを行い、ゼロショットと微調整されたパフォーマンスの両方を比較します。
結果は、適切に設計されたプロンプト戦略を使用すると、LLM が構造化 EHR データに対して堅牢なゼロショット予測能力を示し、従来のモデルをしばしば上回ることを示しました。
ただし、非構造化医学テキストの場合、LLM は、教師ありタスクと教師なしタスクの両方で優れた微調整された BERT モデルを上回るパフォーマンスを発揮しませんでした。
したがって、LLM は構造化データのゼロショット学習には効果的ですが、微調整された BERT モデルは非構造化テキストにより適しており、ヘルスケアにおける NLP テクノロジーの適用を最適化するには、特定のタスク要件とデータ特性に基づいてモデルを選択することの重要性が強調されます。

要約(オリジナル)

The use of Large Language Models (LLMs) in medicine is growing, but their ability to handle both structured Electronic Health Record (EHR) data and unstructured clinical notes is not well-studied. This study benchmarks various models, including GPT-based LLMs, BERT-based models, and traditional clinical predictive models, for non-generative medical tasks utilizing renowned datasets. We assessed 14 language models (9 GPT-based and 5 BERT-based) and 7 traditional predictive models using the MIMIC dataset (ICU patient records) and the TJH dataset (early COVID-19 EHR data), focusing on tasks such as mortality and readmission prediction, disease hierarchy reconstruction, and biomedical sentence matching, comparing both zero-shot and finetuned performance. Results indicated that LLMs exhibited robust zero-shot predictive capabilities on structured EHR data when using well-designed prompting strategies, frequently surpassing traditional models. However, for unstructured medical texts, LLMs did not outperform finetuned BERT models, which excelled in both supervised and unsupervised tasks. Consequently, while LLMs are effective for zero-shot learning on structured data, finetuned BERT models are more suitable for unstructured texts, underscoring the importance of selecting models based on specific task requirements and data characteristics to optimize the application of NLP technology in healthcare.

arxiv情報

著者	Yinghao Zhu,Junyi Gao,Zixiang Wang,Weibin Liao,Xiaochen Zheng,Lifang Liang,Yasha Wang,Chengwei Pan,Ewen M. Harrison,Liantao Ma
発行日	2024-07-26 06:09:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Is larger always better? Evaluating and prompting large language models for non-generative medical tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー