Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine

要約

最近の研究では、Generative Pre-trained Transformer 4 with Vision (GPT-4V) が医療課題タスクにおいて人間の医師を上回るパフォーマンスを示していることが示されています。
ただし、これらの評価は主に多肢選択式の質問の正確さのみに焦点を当てていました。
私たちの研究は、ニューイングランド医学ジャーナル（NEJM）の画像チャレンジ（設計された画像クイズ）を解く際の、画像理解、医学知識の想起、およびステップバイステップのマルチモーダル推論に関するGPT-4Vの理論的根拠の包括的な分析を行うことにより、現在の範囲を拡張します。
医療専門家の知識と診断能力をテストします。
評価の結果、GPT-4V は多肢選択の精度に関して人間の医師よりも優れていることが確認されました (88.0% 対 77.0%、p=0.034)。
GPT-4V は、医師が誤って回答した場合にも優れた性能を発揮し、精度は 80% 以上です。
しかし、GPT-4V は最終的な選択が正しい場合 (27.3%)、画像理解 (21.6%) で最も顕著である場合に、欠陥のある理論的根拠を頻繁に提示することがわかりました。
GPT-4V の多肢選択式質問における精度の高さにもかかわらず、私たちの調査結果は、そのようなモデルを臨床ワークフローに統合する前に、その理論的根拠をさらに詳細に評価する必要性を強調しています。

要約(オリジナル)

Recent studies indicate that Generative Pre-trained Transformer 4 with Vision (GPT-4V) outperforms human physicians in medical challenge tasks. However, these evaluations primarily focused on the accuracy of multi-choice questions alone. Our study extends the current scope by conducting a comprehensive analysis of GPT-4V’s rationales of image comprehension, recall of medical knowledge, and step-by-step multimodal reasoning when solving New England Journal of Medicine (NEJM) Image Challenges – an imaging quiz designed to test the knowledge and diagnostic capabilities of medical professionals. Evaluation results confirmed that GPT-4V outperforms human physicians regarding multi-choice accuracy (88.0% vs. 77.0%, p=0.034). GPT-4V also performs well in cases where physicians incorrectly answer, with over 80% accuracy. However, we discovered that GPT-4V frequently presents flawed rationales in cases where it makes the correct final choices (27.3%), most prominent in image comprehension (21.6%). Regardless of GPT-4V’s high accuracy in multi-choice questions, our findings emphasize the necessity for further in-depth evaluations of its rationales before integrating such models into clinical workflows.

arxiv情報

著者	Qiao Jin,Fangyuan Chen,Yiliang Zhou,Ziyang Xu,Justin M. Cheung,Robert Chen,Ronald M. Summers,Justin F. Rousseau,Peiyun Ni,Marc J Landsman,Sally L. Baxter,Subhi J. Al’Aref,Yijia Li,Michael F. Chiang,Yifan Peng,Zhiyong Lu
発行日	2024-01-16 14:41:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Hidden Flaws Behind Expert-Level Accuracy of GPT-4 Vision in Medicine

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー