BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

要約

大規模な言語モデル（LLMS）は、医療用途に非常に有望であり、急速に進化しており、新しいモデルが加速されたペースでリリースされています。
ただし、臨床状況におけるLLMの現在の評価は限られたままです。
ほとんどの既存のベンチマークは、健康診断スタイルの質問またはPubMed由来のテキストに依存しており、実際の電子健康記録（EHR）データの複雑さをキャプチャできません。
その他は、特定のアプリケーションシナリオに狭く焦点を当てており、より広範な臨床使用にわたって一般化可能性を制限します。
このギャップに対処するために、9つの言語にわたって実際の臨床データソースから供給された87のタスクを含む包括的な多言語ベンチマークであるBridgeを提示します。
さまざまな推論戦略の下で、52の最先端のLLM（DeepSeek-R1、GPT-4O、Gemini、およびLlama 4を含む）を体系的に評価しました。
合計13,572の実験で、我々の結果は、モデルサイズ、言語、自然言語処理タスク、臨床専門分野にわたる大幅なパフォーマンスの変動を明らかにしています。
特に、オープンソースLLMが独自のモデルに匹敵するパフォーマンスを実現できる一方で、古いアーキテクチャに基づく医学的に微調整されたLLMは、しばしば更新された汎用モデルに対してパフォーマンスが低下することがよくあります。
ブリッジとその対応するリーダーボードは、実際の臨床テキスト理解における新しいLLMの開発と評価のための基礎的なリソースおよびユニークな参照として機能します。

要約(オリジナル)

Large language models (LLMs) hold great promise for medical applications and are evolving rapidly, with new models being released at an accelerated pace. However, current evaluations of LLMs in clinical contexts remain limited. Most existing benchmarks rely on medical exam-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record (EHR) data. Others focus narrowly on specific application scenarios, limiting their generalizability across broader clinical use. To address this gap, we present BRIDGE, a comprehensive multilingual benchmark comprising 87 tasks sourced from real-world clinical data sources across nine languages. We systematically evaluated 52 state-of-the-art LLMs (including DeepSeek-R1, GPT-4o, Gemini, and Llama 4) under various inference strategies. With a total of 13,572 experiments, our results reveal substantial performance variation across model sizes, languages, natural language processing tasks, and clinical specialties. Notably, we demonstrate that open-source LLMs can achieve performance comparable to proprietary models, while medically fine-tuned LLMs based on older architectures often underperform versus updated general-purpose models. The BRIDGE and its corresponding leaderboard serve as a foundational resource and a unique reference for the development and evaluation of new LLMs in real-world clinical text understanding.

arxiv情報

著者	Jiageng Wu,Bowen Gu,Ren Zhou,Kevin Xie,Doug Snyder,Yixing Jiang,Valentina Carducci,Richard Wyss,Rishi J Desai,Emily Alsentzer,Leo Anthony Celi,Adam Rodman,Sebastian Schneeweiss,Jonathan H. Chen,Santiago Romero-Brufau,Kueiyu Joshua Lin,Jie Yang
発行日	2025-04-28 04:13:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー