MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

要約

専門家レベルの医療知識と高度な推論を評価するために、非常に挑戦的で包括的なベンチマークであるMedxpertqaを紹介します。
Medxpertqaには、17の専門分野と11のボディシステムにまたがる4,460の質問が含まれています。
これには、テキスト評価用のテキストとマルチモーダル評価用のMMの2つのサブセットが含まれています。
特に、MMは、画像キャプテンから生成された単純なQAペアを使用した従来の医療マルチモーダルベンチマークとは一線を画す、患者記録や試験結果を含む、多様な画像や豊富な臨床情報を含む専門家レベルの試験の質問を導入します。
Medxpertqaは、厳密なフィルタリングと増強を適用して、MEDQAなどの既存のベンチマークの不十分な難しさに対処し、臨床的関連性と包括性を改善するために専門委員会の質問を組み込みます。
データ統合を実行して、データの漏れリスクを軽減し、精度と信頼性を確保するために複数の専門家レビューを実施します。
Medxpertqaの16の主要なモデルを評価します。
さらに、薬は現実世界の意思決定に深く関係しており、数学やコードを超えて推論能力を評価するための豊かで代表的な設定を提供します。
この目的のために、O1様モデルの評価を促進するために、推論指向のサブセットを開発します。

要約(オリジナル)

We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 16 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.

arxiv情報

著者	Yuxin Zuo,Shang Qu,Yifei Li,Zhangren Chen,Xuekai Zhu,Ermo Hua,Kaiyan Zhang,Ning Ding,Bowen Zhou
発行日	2025-01-30 14:07:56+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー