AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

要約

患者の診断と管理は複雑で連続的な意思決定プロセスであり、医師はどの検査を実行するかなどの情報を入手し、それに基づいて行動する必要があります。
人工知能 (AI) と大規模言語モデル (LLM) の最近の進歩は、臨床ケアに大きな影響を与えることが期待されています。
しかし、現在の評価スキームは静的な医療質問応答ベンチマークに過度に依存しており、実際の臨床業務で必要とされる対話型の意思決定には不十分です。
ここでは、模擬臨床環境でエージェントとして動作する LLM の能力を評価するマルチモーダルベンチマークである AgentClinic を紹介します。
私たちのベンチマークでは、医師エージェントは対話と積極的なデータ収集を通じて患者の診断を明らかにする必要があります。
マルチモーダル画像および対話環境である AgentClinic-NEJM と、対話のみの環境である AgentClinic-MedQA の 2 つのオープンベンチマークを紹介します。
患者エージェントと医師エージェントの両方に認知バイアスと暗黙バイアスを埋め込み、バイアスのあるエージェント間の現実的な相互作用をエミュレートします。
バイアスを導入すると、医師エージェントの診断精度が大幅に低下するだけでなく、患者エージェントのコンプライアンス、自信、フォローアップ相談意欲も低下することがわかりました。
最先端の LLM スイートを評価すると、MedQA などのベンチマークで優れているいくつかのモデルが、AgentClinic-MedQA ではパフォーマンスが低いことがわかりました。
患者エージェントで使用される LLM が、AgentClinic ベンチマークのパフォーマンスにとって重要な要素であることがわかりました。
我々は、インタラクションが限られている場合とインタラクションが多すぎる場合の両方が、ドクターエージェントの診断精度を低下させることを示します。
この作業のコードとデータは、https://AgentClinic.github.io で公開されています。

要約(オリジナル)

Diagnosing and managing a patient is a complex, sequential decision making process that requires physicians to obtain information — such as which tests to perform — and to act upon it. Recent advances in artificial intelligence (AI) and large language models (LLMs) promise to profoundly impact clinical care. However, current evaluation schemes overrely on static medical question-answering benchmarks, falling short on interactive decision-making that is required in real-life clinical work. Here, we present AgentClinic: a multimodal benchmark to evaluate LLMs in their ability to operate as agents in simulated clinical environments. In our benchmark, the doctor agent must uncover the patient’s diagnosis through dialogue and active data collection. We present two open benchmarks: a multimodal image and dialogue environment, AgentClinic-NEJM, and a dialogue-only environment, AgentClinic-MedQA. We embed cognitive and implicit biases both in patient and doctor agents to emulate realistic interactions between biased agents. We find that introducing bias leads to large reductions in diagnostic accuracy of the doctor agents, as well as reduced compliance, confidence, and follow-up consultation willingness in patient agents. Evaluating a suite of state-of-the-art LLMs, we find that several models that excel in benchmarks like MedQA are performing poorly in AgentClinic-MedQA. We find that the LLM used in the patient agent is an important factor for performance in the AgentClinic benchmark. We show that both having limited interactions as well as too many interaction reduces diagnostic accuracy in doctor agents. The code and data for this work is publicly available at https://AgentClinic.github.io.

arxiv情報

著者	Samuel Schmidgall,Rojin Ziaei,Carl Harris,Eduardo Reis,Jeffrey Jopling,Michael Moor
発行日	2024-05-13 17:38:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー