Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

要約

検索拡張生成 (RAG) は、医療分野などの知識集約型タスクにおける大規模言語モデル (LLM) のパフォーマンスを向上させる有望なアプローチとして浮上しています。
ただし、医療分野はデリケートな性質を持っているため、完全に正確で信頼できるシステムが必要です。
既存の RAG ベンチマークは主に標準の検索応答設定に焦点を当てていますが、信頼性の高い医療システムの重要な側面を測定する多くの実践的なシナリオを見落としています。
このペーパーでは、十分性、統合性、堅牢性など、これらの状況に対応する RAG 設定における医療質問応答 (QA) システムの包括的な評価フレームワークを提供することで、このギャップに対処します。
これらの特定のシナリオを処理する LLM の能力をテストするために、4 つの医療 QA データセットにさまざまな補足要素を提供する Medical Retrieval-Augmented Generation Benchmark (MedRGB) を紹介します。
MedRGB を利用して、複数の検索条件にわたって最先端の商用 LLM とオープンソースモデルの両方について広範な評価を実施します。
私たちの実験結果は、取得した文書内のノイズや誤った情報を処理する現在のモデルの能力が限られていることを明らかにしました。
私たちは LLM の推論プロセスをさらに分析し、この重要な医療領域で RAG システムを開発するための貴重な洞察と将来の方向性を提供します。

要約(オリジナル)

Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) in knowledge-intensive tasks such as those from medical domain. However, the sensitive nature of the medical domain necessitates a completely accurate and trustworthy system. While existing RAG benchmarks primarily focus on the standard retrieve-answer setting, they overlook many practical scenarios that measure crucial aspects of a reliable medical system. This paper addresses this gap by providing a comprehensive evaluation framework for medical question-answering (QA) systems in a RAG setting for these situations, including sufficiency, integration, and robustness. We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets for testing LLMs’ ability to handle these specific scenarios. Utilizing MedRGB, we conduct extensive evaluations of both state-of-the-art commercial LLMs and open-source models across multiple retrieval conditions. Our experimental results reveals current models’ limited ability to handle noise and misinformation in the retrieved documents. We further analyze the LLMs’ reasoning processes to provides valuable insights and future directions for developing RAG systems in this critical medical domain.

arxiv情報

著者	Nghia Trung Ngo,Chien Van Nguyen,Franck Dernoncourt,Thien Huu Nguyen
発行日	2024-11-14 06:19:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー