RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance

要約

検索拡張生成 (RAG) は、外部知識を使用して応答生成をガイドすることで大規模言語モデル (LLM) を改善し、幻覚を軽減します。
ただし、RAG、特にマルチモーダル RAG は、新しい幻覚ソースを導入する可能性があります。(i) 検索プロセスでは、データベースから生のコンテキストとして無関係な部分 (文書、画像など) が選択される可能性があり、(ii) 検索された画像はテキストに処理されます。
ビジョン言語モデル (VLM) を介した、または GPT-4o のようなマルチモーダル言語モデル (MLLM) によって直接使用される、幻覚を引き起こす可能性があるベースのコンテキスト。
これに対処するために、我々は、(i) クエリに対する検索されたエントリの関連性を評価する関連性スコア (RS)、および (ii) 正確性スコアの 2 つのパフォーマンス尺度を使用して、マルチモーダル RAG の信頼性を評価する新しいフレームワークを提案します。
(CS)、生成された応答の精度を評価します。
ChatGPT 由来のデータベースと人間の評価者のサンプルを使用して、RS モデルと CS モデルをトレーニングします。
結果は、両方のモデルがテストデータで最大 88% の精度を達成していることを示しています。
さらに、取得した部分の関連性と応答ステートメントの正確さを評価する、人間による注釈付きの 5,000 サンプルのデータベースを構築します。
当社の RS モデルは、検索において CLIP より 20% 多く人間の好みと一致し、CS モデルは最大 91% の確率で人間の好みと一致します。
最後に、RS と CS を使用して、さまざまな RAG システムの選択と生成のパフォーマンスを評価します。

要約(オリジナル)

Retrieval-augmented generation (RAG) improves large language models (LLMs) by using external knowledge to guide response generation, reducing hallucinations. However, RAG, particularly multi-modal RAG, can introduce new hallucination sources: (i) the retrieval process may select irrelevant pieces (e.g., documents, images) as raw context from the database, and (ii) retrieved images are processed into text-based context via vision-language models (VLMs) or directly used by multi-modal language models (MLLMs) like GPT-4o, which may hallucinate. To address this, we propose a novel framework to evaluate the reliability of multi-modal RAG using two performance measures: (i) the relevancy score (RS), assessing the relevance of retrieved entries to the query, and (ii) the correctness score (CS), evaluating the accuracy of the generated response. We train RS and CS models using a ChatGPT-derived database and human evaluator samples. Results show that both models achieve ~88% accuracy on test data. Additionally, we construct a 5000-sample human-annotated database evaluating the relevancy of retrieved pieces and the correctness of response statements. Our RS model aligns with human preferences 20% more often than CLIP in retrieval, and our CS model matches human preferences ~91% of the time. Finally, we assess various RAG systems’ selection and generation performances using RS and CS.

arxiv情報

著者	Matin Mortaheb,Mohammad A. Amir Khojastepour,Srimat T. Chakradhar,Sennur Ulukus
発行日	2025-01-07 18:52:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RAG-Check: Evaluating Multimodal Retrieval Augmented Generation Performance

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー