HealthBench: Evaluating Large Language Models Towards Improved Human Health

要約

ヘルスケアにおける大規模な言語モデルのパフォーマンスと安全性を測定するオープンソースのベンチマークであるHealthbenchを紹介します。
Healthbenchは、モデルと個々のユーザーまたはヘルスケアの専門家との間の5,000のマルチターン会話で構成されています。
応答は、262人の医師によって作成された会話固有のルーブリックを使用して評価されます。
以前の複数選択または短いアンドワーベンチマークとは異なり、Healthbenchは、いくつかの健康状況（緊急事態、臨床データの変換、世界的な健康の変換）および行動の寸法（例：正確性、指導、コミュニケーション）にまたがる48,562のユニークなルーブリック基準を通じて、現実的で自由回答形式の評価を可能にします。
過去2年間のヘルスベンチのパフォーマンスは、安定した初期進行を反映しています（GPT-3.5ターボの16％をGPT-4Oの32％と比較）、より迅速な最近の改善（O3スコア60％）。
小規模なモデルは特に改善されています。GPT-4.1ナノはGPT-4Oを上回り、25倍安いです。
さらに、2つのヘルスベンチのバリエーションをリリースします。これには、医師のコンセンサスを介して検証されたモデル行動の34の特に重要な側面と、現在のトップスコアが32％であるHealthbench Hardを含むHealthbenchコンセンサスがリリースされます。
ヘルスベンチが、人間の健康に役立つモデル開発とアプリケーションに向けて進歩することを願っています。

要約(オリジナル)

We present HealthBench, an open-source benchmark measuring the performance and safety of large language models in healthcare. HealthBench consists of 5,000 multi-turn conversations between a model and an individual user or healthcare professional. Responses are evaluated using conversation-specific rubrics created by 262 physicians. Unlike previous multiple-choice or short-answer benchmarks, HealthBench enables realistic, open-ended evaluation through 48,562 unique rubric criteria spanning several health contexts (e.g., emergencies, transforming clinical data, global health) and behavioral dimensions (e.g., accuracy, instruction following, communication). HealthBench performance over the last two years reflects steady initial progress (compare GPT-3.5 Turbo’s 16% to GPT-4o’s 32%) and more rapid recent improvements (o3 scores 60%). Smaller models have especially improved: GPT-4.1 nano outperforms GPT-4o and is 25 times cheaper. We additionally release two HealthBench variations: HealthBench Consensus, which includes 34 particularly important dimensions of model behavior validated via physician consensus, and HealthBench Hard, where the current top score is 32%. We hope that HealthBench grounds progress towards model development and applications that benefit human health.

arxiv情報

著者	Rahul K. Arora,Jason Wei,Rebecca Soskin Hicks,Preston Bowman,Joaquin Quiñonero-Candela,Foivos Tsimpourlas,Michael Sharman,Meghan Shah,Andrea Vallone,Alex Beutel,Johannes Heidecke,Karan Singhal
発行日	2025-05-13 17:53:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

HealthBench: Evaluating Large Language Models Towards Improved Human Health

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー