Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

要約

がん患者は、医療情報の新しい形式のインターネット検索として大規模な言語モデル（LLM）にますます目を向けているため、これらのモデルが複雑でパーソナライズされた質問をどの程度処理するかを評価することが重要です。
ただし、現在の医療ベンチマークは、健康診断または消費者が検索した質問に焦点を当てており、詳細な臨床コンテキストで実際の患者の質問についてLLMを評価しません。
この論文では、3人の血液腫瘍医師によってレビューされた、実際の患者から描かれたがん関連の質問についてLLMSを最初に評価します。
通常、回答は正確であり、GPT-4ターボは5つのうち4.13を獲得していますが、モデルは、安全な医療上の意思決定に対する質問に位置するリスクの誤った前提を認識または対処できないことがよくあります。
この制限を体系的に研究するために、誤った前提を伴う585のがん関連の質問の専門家で検証された敵対的なデータセットであるCancer-Mythを紹介します。
このベンチマークでは、GPT-4O、GEMINI-1.PRO、CLAUDE-3.5-SONNETを含むフロンティアLLMは、これらの誤った前提を30％以上修正します。
高度な医療エージェントの方法でさえ、LLMSが誤った前提を無視することを妨げません。
これらの発見は、LLMSの臨床的信頼性に重大なギャップをもたらし、医療AIシステムのより堅牢な保護手段の必要性を強調しています。

要約(オリジナル)

Cancer patients are increasingly turning to large language models (LLMs) as a new form of internet search for medical information, making it critical to assess how well these models handle complex, personalized questions. However, current medical benchmarks focus on medical exams or consumer-searched questions and do not evaluate LLMs on real patient questions with detailed clinical contexts. In this paper, we first evaluate LLMs on cancer-related questions drawn from real patients, reviewed by three hematology oncology physicians. While responses are generally accurate, with GPT-4-Turbo scoring 4.13 out of 5, the models frequently fail to recognize or address false presuppositions in the questions-posing risks to safe medical decision-making. To study this limitation systematically, we introduce Cancer-Myth, an expert-verified adversarial dataset of 585 cancer-related questions with false presuppositions. On this benchmark, no frontier LLM — including GPT-4o, Gemini-1.Pro, and Claude-3.5-Sonnet — corrects these false presuppositions more than 30% of the time. Even advanced medical agentic methods do not prevent LLMs from ignoring false presuppositions. These findings expose a critical gap in the clinical reliability of LLMs and underscore the need for more robust safeguards in medical AI systems.

arxiv情報

著者	Wang Bill Zhu,Tianqi Chen,Ching Ying Lin,Jade Law,Mazen Jizzini,Jorge J. Nieva,Ruishan Liu,Robin Jia
発行日	2025-04-15 16:37:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー