PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

要約

大規模なマルチモーダルモデルの出現により、AI、特に病理学における顕著な可能性が解き放たれました。
しかし、専門的で高品質なベンチマークが存在しないため、その開発と正確な評価が妨げられていました。
これに対処するために、専門家によって検証された最大かつ最高品質の LMM 病理学ベンチマークである PathMMU を導入します。
これは、さまざまなソースからの 33,573 個のマルチモーダル多肢選択問題と 21,599 枚の画像で構成されており、各質問には正解の説明が付いています。
PathMMU の構築では GPT-4V の堅牢な機能を活用し、収集された約 30,000 の画像とキャプションのペアを利用して Q&A を生成します。
重要なのは、PathMMU の権威を最大限に高めるために、PathMMU の検証およびテストセットの厳格な基準に基づいて各質問を精査するために 6 人の病理学者を招待すると同時に、PathMMU の専門家レベルのパフォーマンスベンチマークを設定していることです。
私たちは、14 個のオープンソース LMM と 3 個のクローズドソース LMM のゼロショット評価と、画像破損に対する堅牢性を含む広範な評価を実施しています。
また、代表的な LMM を微調整して、PathMMU への適応性を評価します。
経験的調査結果は、高度な LMM は、困難な PathMMU ベンチマークに苦戦していることを示しています。最高性能の LMM である GPT-4V は、ゼロショットパフォーマンスが 51.7% しか達成しておらず、人間の病理学者が実証した 71.4% よりも大幅に低いことが示されています。
微調整後は、オープンソースの LMM でも 60\% 以上のパフォーマンスで GPT-4V を上回る可能性がありますが、病理学者が示す専門知識にはまだ達していません。
私たちは、PathMMU が貴重な洞察を提供し、病理学のためのより専門化された次世代 LLM の開発を促進することを期待しています。

要約(オリジナル)

The emergence of large multimodal models has unlocked remarkable potential in AI, particularly in pathology. However, the lack of specialized, high-quality benchmark impeded their development and precise evaluation. To address this, we introduce PathMMU, the largest and highest-quality expert-validated pathology benchmark for LMMs. It comprises 33,573 multimodal multi-choice questions and 21,599 images from various sources, and an explanation for the correct answer accompanies each question. The construction of PathMMU capitalizes on the robust capabilities of GPT-4V, utilizing approximately 30,000 gathered image-caption pairs to generate Q\&As. Significantly, to maximize PathMMU’s authority, we invite six pathologists to scrutinize each question under strict standards in PathMMU’s validation and test sets, while simultaneously setting an expert-level performance benchmark for PathMMU. We conduct extensive evaluations, including zero-shot assessments of 14 open-sourced and three closed-sourced LMMs and their robustness to image corruption. We also fine-tune representative LMMs to assess their adaptability to PathMMU. The empirical findings indicate that advanced LMMs struggle with the challenging PathMMU benchmark, with the top-performing LMM, GPT-4V, achieving only a 51.7\% zero-shot performance, significantly lower than the 71.4\% demonstrated by human pathologists. After fine-tuning, even open-sourced LMMs can surpass GPT-4V with a performance of over 60\%, but still fall short of the expertise shown by pathologists. We hope that the PathMMU will offer valuable insights and foster the development of more specialized, next-generation LLMs for pathology.

arxiv情報

著者	Yuxuan Sun,Hao Wu,Chenglu Zhu,Sunyi Zheng,Qizi Chen,Kai Zhang,Yunlong Zhang,Xiaoxiao Lan,Mengyue Zheng,Jingxiong Li,Xinheng Lyu,Tao Lin,Lin Yang
発行日	2024-01-29 17:59:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PathMMU: A Massive Multimodal Expert-Level Benchmark for Understanding and Reasoning in Pathology

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー