Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports

要約

はじめに: 大規模言語モデル (LLM) の急速な進歩に伴い、商用モデルだけでなく新しいオープンソースモデルも数多く登場しました。
最近の出版物では、放射線医学レポートから関心のある情報を抽出するための GPT-4 の応用について検討されていますが、GPT-4 とさまざまな主要なオープンソースモデルとの実際の比較は行われていません。
材料と方法: 2 つの異なる独立したデータセットが使用されました。
最初のデータセットは、2019 年 7 月から 2021 年 7 月の間にマサチューセッツ総合病院で作成された 540 件の胸部 X 線レポートで構成されています。2 番目のデータセットは、ImaGenome データセットからの 500 件の胸部 X 線レポートで構成されています。
次に、OpenAI の商用モデル GPT-3.5 Turbo および GPT-4 を、オープンソースモデル Mistral-7B、Mixtral-8x7B、Llama2-13B、Llama2-70B、QWEN1.5-72B、CheXbert および CheXpert-labeler と比較しました。
さまざまなプロンプト手法を使用して、X 線テキストレポート内の複数の所見の存在を正確にラベル付けする能力。
結果: ImaGenome データセット上で最もパフォーマンスの良いオープンソースモデルは Llama2-70B で、マイクロ F1 スコアはゼロショットプロンプトと少数ショットプロンプトでそれぞれ 0.972 と 0.970 でした。
GPT-4 は、それぞれ 0.975 と 0.984 のマイクロ F1 スコアを達成しました。
機関のデータセットでは、最もパフォーマンスの高いオープンソースモデルは QWEN1.5-72B で、ゼロショットプロンプトと少数ショットプロンプトのマイクロ F1 スコアはそれぞれ 0.952 と 0.965 でした。
GPT-4 は、それぞれ 0.975 と 0.973 のマイクロ F1 スコアを達成しました。
結論: この論文では、ゼロショットレポートのラベル付けでは GPT-4 がオープンソースモデルより優れている一方で、少数ショットプロンプトの実装によりオープンソースモデルを GPT-4 と同等にできることを示します。
これは、オープンソースモデルが、放射線医学レポートの分類タスクにおいて、GPT-4 に代わる高性能かつプライバシー保護の代替手段となる可能性があることを示しています。

要約(オリジナル)

Introduction: With the rapid advances in large language models (LLMs), there have been numerous new open source as well as commercial models. While recent publications have explored GPT-4 in its application to extracting information of interest from radiology reports, there has not been a real-world comparison of GPT-4 to different leading open-source models. Materials and Methods: Two different and independent datasets were used. The first dataset consists of 540 chest x-ray reports that were created at the Massachusetts General Hospital between July 2019 and July 2021. The second dataset consists of 500 chest x-ray reports from the ImaGenome dataset. We then compared the commercial models GPT-3.5 Turbo and GPT-4 from OpenAI to the open-source models Mistral-7B, Mixtral-8x7B, Llama2-13B, Llama2-70B, QWEN1.5-72B and CheXbert and CheXpert-labeler in their ability to accurately label the presence of multiple findings in x-ray text reports using different prompting techniques. Results: On the ImaGenome dataset, the best performing open-source model was Llama2-70B with micro F1-scores of 0.972 and 0.970 for zero- and few-shot prompts, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.984, respectively. On the institutional dataset, the best performing open-source model was QWEN1.5-72B with micro F1-scores of 0.952 and 0.965 for zero- and few-shot prompting, respectively. GPT-4 achieved micro F1-scores of 0.975 and 0.973, respectively. Conclusion: In this paper, we show that while GPT-4 is superior to open-source models in zero-shot report labeling, the implementation of few-shot prompting can bring open-source models on par with GPT-4. This shows that open-source models could be a performant and privacy preserving alternative to GPT-4 for the task of radiology report classification.

arxiv情報

著者	Felix J. Dorfner,Liv Jürgensen,Leonhard Donle,Fares Al Mohamad,Tobias R. Bodenmann,Mason C. Cleveland,Felix Busch,Lisa C. Adams,James Sato,Thomas Schultz,Albert E. Kim,Jameson Merkow,Keno K. Bressem,Christopher P. Bridge
発行日	2024-02-19 17:23:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Is Open-Source There Yet? A Comparative Study on Commercial and Open-Source LLMs in Their Ability to Label Chest X-Ray Reports

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー