MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

要約

人工知能(AI)は、ヘルスケア、特に疾病診断や治療計画において大きな可能性を示している。最近の医療用大型視覚言語モデル（Medical Large Vision-Language Models：Med-LVLM）の進歩は、対話型診断ツールの新たな可能性を開いている。しかし、これらのモデルはしばしば事実の幻覚に悩まされ、誤った診断につながる可能性がある。このような問題に対処する方法として、ファインチューニングと検索補強型生成（RAG）が登場した。しかし、高品質なデータの量や、学習データと展開データ間の分布のずれが、ファインチューニング手法の適用を制限している。RAGは軽量で効果的であるが、既存のRAGベースのアプローチは、異なる医療領域に対する汎用性が十分ではなく、モダリティ間、モデルとグランドトゥルース間の両方で、ミスアライメントの問題を潜在的に引き起こす可能性がある。本論文では、Med-LVLMの事実性を高めるために設計された、汎用性の高いマルチモーダルRAGシステム、MMed-RAGを提案する。我々のアプローチは、領域を考慮した検索メカニズム、適応的な検索されたコンテクストの選択方法、および証明可能なRAGに基づく嗜好の微調整戦略を導入している。これらの革新的な技術により、RAGプロセスは十分に一般的で信頼性が高くなり、検索されたコンテキストを導入する際のアライメントが大幅に改善される。医療VQAとレポート生成に関する5つの医療データセット（放射線学、眼科学、病理学を含む）の実験結果は、MMed-RAGがMed-LVLMの事実精度で平均43.8%の改善を達成できることを示している。我々のデータとコードはhttps://github.com/richard-peng-xia/MMed-RAG。

要約(オリジナル)

Artificial Intelligence (AI) has demonstrated significant potential in healthcare, particularly in disease diagnosis and treatment planning. Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. However, these models often suffer from factual hallucination, which can lead to incorrect diagnoses. Fine-tuning and retrieval-augmented generation (RAG) have emerged as methods to address these issues. However, the amount of high-quality data and distribution shifts between training data and deployment data limit the application of fine-tuning methods. Although RAG is lightweight and effective, existing RAG-based approaches are not sufficiently general to different medical domains and can potentially cause misalignment issues, both between modalities and between the model and the ground truth. In this paper, we propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs. Our approach introduces a domain-aware retrieval mechanism, an adaptive retrieved contexts selection method, and a provable RAG-based preference fine-tuning strategy. These innovations make the RAG process sufficiently general and reliable, significantly improving alignment when introducing retrieved contexts. Experimental results across five medical datasets (involving radiology, ophthalmology, pathology) on medical VQA and report generation demonstrate that MMed-RAG can achieve an average improvement of 43.8% in the factual accuracy of Med-LVLMs. Our data and code are available in https://github.com/richard-peng-xia/MMed-RAG.

arxiv情報

著者	Peng Xia,Kangyu Zhu,Haoran Li,Tianze Wang,Weijia Shi,Sheng Wang,Linjun Zhang,James Zou,Huaxiu Yao
発行日	2025-03-03 03:08:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー