MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models

要約

医療大規模言語モデル(MLLM)は、医療アプリケーションにおいて潜在的な可能性を示しているが、その幻覚傾向（医学的にありえない、あるいは不正確な情報を生成する）は、患者のケアに大きなリスクをもたらす。本稿では、MLLMにおける幻覚を評価し、軽減するための包括的なベンチマークフレームワークであるMedHallBenchを紹介する。我々の方法論は、専門家の検証を経た医療症例シナリオと確立された医療データベースを統合し、強固な評価データセットを作成する。このフレームワークは、自動化されたACHMI（Automatic Caption Hallucination Measurement in Medical Imaging）スコアリングと厳密な臨床専門家の評価を組み合わせた高度な測定システムを採用しており、自動アノテーションを実現するために強化学習手法を利用している。MedHallBenchは、特に医療アプリケーション用に設計された最適化された人間のフィードバックからの強化学習（RLHF）トレーニングパイプラインを通じて、厳格な精度基準を維持しながら、多様な臨床コンテキストにおけるMLLMの徹底的な評価を可能にする。我々は、広く採用されている大規模言語モデル（LLM）のベースラインを確立するためにベンチマークを利用し、様々なモデルを含む比較実験を行った。その結果、ACHMIは従来の指標と比較して、幻覚の影響をよりニュアンス豊かに理解できることが示され、幻覚評価における優位性が浮き彫りになった。本研究は、医療現場におけるMLLMの信頼性を高めるための基礎的な枠組みを確立し、医療アプリケーションにおけるAIの幻覚という重要な課題に対処するための実用的な戦略を提示する。

要約(オリジナル)

Medical Large Language Models (MLLMs) have demonstrated potential in healthcare applications, yet their propensity for hallucinations — generating medically implausible or inaccurate information — presents substantial risks to patient care. This paper introduces MedHallBench, a comprehensive benchmark framework for evaluating and mitigating hallucinations in MLLMs. Our methodology integrates expert-validated medical case scenarios with established medical databases to create a robust evaluation dataset. The framework employs a sophisticated measurement system that combines automated ACHMI (Automatic Caption Hallucination Measurement in Medical Imaging) scoring with rigorous clinical expert evaluations and utilizes reinforcement learning methods to achieve automatic annotation. Through an optimized reinforcement learning from human feedback (RLHF) training pipeline specifically designed for medical applications, MedHallBench enables thorough evaluation of MLLMs across diverse clinical contexts while maintaining stringent accuracy standards. We conducted comparative experiments involving various models, utilizing the benchmark to establish a baseline for widely adopted large language models (LLMs). Our findings indicate that ACHMI provides a more nuanced understanding of the effects of hallucinations compared to traditional metrics, thereby highlighting its advantages in hallucination assessment. This research establishes a foundational framework for enhancing MLLMs’ reliability in healthcare settings and presents actionable strategies for addressing the critical challenge of AI hallucinations in medical applications.

arxiv情報

著者	Kaiwen Zuo,Yirui Jiang
発行日	2025-01-03 00:16:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MedHallBench: A New Benchmark for Assessing Hallucination in Medical Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー