Beyond the Hype: A dispassionate look at vision-language models in medical scenario

要約

Large Vision-Language Model (LVLM) の最近の進歩は、さまざまなタスクにわたって顕著な機能を実証し、AI コミュニティで大きな注目を集めています。
しかし、医療などの特殊な領域におけるその性能と信頼性は、依然として十分に評価されていません。
特に、ほとんどの評価は、LVLM の詳細な特性を無視しながら、マルチモダリティデータに対する単純な視覚的質問応答 (VQA) に基づいて VLM を評価することに過度に集中しています。
この研究では、既存の LVLM を包括的に評価するための新しい放射線学的視覚理解および質問応答ベンチマークである RadVUQA を紹介します。
RadVUQA は主に 5 つの次元にわたって LVLM を検証します。1) 解剖学的理解、生物学的構造を視覚的に識別するモデルの能力を評価します。
2) マルチモーダルな理解。これには、言語的および視覚的な指示を解釈して、望ましい結果を生み出す能力が含まれます。
3) 定量的および空間的推論。モデルの空間認識と定量分析を視覚情報および言語情報と組み合わせる習熟度を評価します。
4) 生理学的知識、器官やシステムの機能やメカニズムを理解するモデルの能力を測定します。
5) ロバスト性。調和のとれていない合成データに対してモデルの機能を評価します。
この結果は、一般化された LVLM と医療固有の LVLM の両方に、マルチモーダルな理解と定量的推論能力が弱いという重大な欠陥があることを示しています。
私たちの調査結果は、既存の LVLM と臨床医との間に大きなギャップがあることを明らかにし、より堅牢でインテリジェントな LVLM が緊急に必要であることを浮き彫りにしています。
コードとデータセットは、この論文が受理された後に利用可能になります。

要約(オリジナル)

Recent advancements in Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across diverse tasks, garnering significant attention in AI communities. However, their performance and reliability in specialized domains such as medicine remain insufficiently assessed. In particular, most assessments over-concentrate in evaluating VLMs based on simple Visual Question Answering (VQA) on multi-modality data, while ignoring the in-depth characteristic of LVLMs. In this study, we introduce RadVUQA, a novel Radiological Visual Understanding and Question Answering benchmark, to comprehensively evaluate existing LVLMs. RadVUQA mainly validates LVLMs across five dimensions: 1) Anatomical understanding, assessing the models’ ability to visually identify biological structures; 2) Multimodal comprehension, which involves the capability of interpreting linguistic and visual instructions to produce desired outcomes; 3) Quantitative and spatial reasoning, evaluating the models’ spatial awareness and proficiency in combining quantitative analysis with visual and linguistic information; 4) Physiological knowledge, measuring the models’ capability to comprehend functions and mechanisms of organs and systems; and 5) Robustness, which assesses the models’ capabilities against unharmonised and synthetic data. The results indicate that both generalized LVLMs and medical-specific LVLMs have critical deficiencies with weak multimodal comprehension and quantitative reasoning capabilities. Our findings reveal the large gap between existing LVLMs and clinicians, highlighting the urgent need for more robust and intelligent LVLMs. The code and dataset will be available after the acceptance of this paper.

arxiv情報

著者	Yang Nan,Huichi Zhou,Xiaodan Xing,Guang Yang
発行日	2024-08-16 12:32:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Beyond the Hype: A dispassionate look at vision-language models in medical scenario

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー