MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

要約

検索拡張生成（RAG）システムの自動評価は、専門家のアノテーターによって判断されるように、忠実さや関連性などのきめ細かな次元に依存しています。
メタ評価ベンチマークは、人間の判断とよく相関する自動評価者の開発をサポートしています。
ただし、既存のベンチマークは主に英語に焦点を当てたり、文化的なニュアンスをキャプチャできない翻訳データを使用しています。
ネイティブアプローチは、エンドユーザーエクスペリエンスのより良い表現を提供します。
この作業では、多言語のエンドツーエンドのメタ評価ラグベンチマーク（MEMERAG）を開発します。
私たちのベンチマークは、ネイティブ言語の質問を使用して、多様な大規模な言語モデル（LLM）を使用して回答を生成する人気のあるMiraclデータセットに基づいて構築されます。
注釈プロセスについて説明し、それが高いアノテーター間契約を達成していることを示します。
次に、人間の評価者に従って、言語間で回答を生成するLLMのパフォーマンスを分析します。
最後に、データセットをメインのユースケースに適用します。これは、多言語の自動評価者（LLM-As-a-judge）のベンチマークです。
私たちのベンチマークは、高度なプロンプト技術とLLMによって提供される改善を確実に識別できることを示しています。
データセットはhttps://github.com/amazon-science/memeragで入手できます

要約(オリジナル)

Automatic evaluation of retrieval augmented generation (RAG) systems relies on fine-grained dimensions like faithfulness and relevance, as judged by expert human annotators. Meta-evaluation benchmarks support the development of automatic evaluators that correlate well with human judgement. However, existing benchmarks predominantly focus on English or use translated data, which fails to capture cultural nuances. A native approach provides a better representation of the end user experience. In this work, we develop a Multilingual End-to-end Meta-Evaluation RAG benchmark (MEMERAG). Our benchmark builds on the popular MIRACL dataset, using native-language questions and generating responses with diverse large language models (LLMs), which are then assessed by expert annotators for faithfulness and relevance. We describe our annotation process and show that it achieves high inter-annotator agreement. We then analyse the performance of the answer-generating LLMs across languages as per the human evaluators. Finally we apply the dataset to our main use-case which is to benchmark multilingual automatic evaluators (LLM-as-a-judge). We show that our benchmark can reliably identify improvements offered by advanced prompting techniques and LLMs. Our dataset is available at https://github.com/amazon-science/MEMERAG

arxiv情報

著者	María Andrea Cruz Blandón,Jayasimha Talur,Bruno Charron,Dong Liu,Saab Mansour,Marcello Federico
発行日	2025-04-29 07:28:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MEMERAG: A Multilingual End-to-End Meta-Evaluation Benchmark for Retrieval Augmented Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー