Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

要約

検索された生成（RAG）により、大規模な言語モデル（LLM）は、「グラウンドトゥルース」を含むソースドキュメントからの引用を使用して回答を生成し、それによりシステムの幻覚が減少します。
RAG評価の重要な要因は、引用されたドキュメントの情報が答えをサポートしているかどうかにかかわらず、「サポート」です。
この目的のために、36のトピックに関する45人の参加者提出に関する大規模な比較研究を実施し、TREC 2024 RAGトラックに、サポート評価のために自動LLMジャッジ（GPT-4O）を人間の裁判官と比較しました。
2つの条件を検討しました。（1）ゼロからの完全な手動評価と（2）LLM予測の編集後の手動評価。
我々の結果は、マニュアルからのマニュアルからの56％の場合、人間とGPT-4Oの予測は完全に（3レベルのスケールで）完全に一致し、編集後の状態でマニュアルで72％に増加することを示しています。
さらに、公平な研究で意見の相違を慎重に分析することにより、独立した人間の裁判官は、人間の裁判官よりもGPT-4Oとよりよく相関していることがわかり、LLM裁判官はサポート評価のための信頼できる代替手段になることが示唆されました。
結論として、サポート評価の将来の反復を導くのに役立つヒトおよびGPT-4Oエラーの定性分析を提供します。

要約(オリジナル)

Retrieval-augmented generation (RAG) enables large language models (LLMs) to generate answers with citations from source documents containing ‘ground truth’, thereby reducing system hallucinations. A crucial factor in RAG evaluation is ‘support’, whether the information in the cited documents supports the answer. To this end, we conducted a large-scale comparative study of 45 participant submissions on 36 topics to the TREC 2024 RAG Track, comparing an automatic LLM judge (GPT-4o) against human judges for support assessment. We considered two conditions: (1) fully manual assessments from scratch and (2) manual assessments with post-editing of LLM predictions. Our results indicate that for 56% of the manual from-scratch assessments, human and GPT-4o predictions match perfectly (on a three-level scale), increasing to 72% in the manual with post-editing condition. Furthermore, by carefully analyzing the disagreements in an unbiased study, we found that an independent human judge correlates better with GPT-4o than a human judge, suggesting that LLM judges can be a reliable alternative for support assessment. To conclude, we provide a qualitative analysis of human and GPT-4o errors to help guide future iterations of support assessment.

arxiv情報

著者	Nandan Thakur,Ronak Pradeep,Shivani Upadhyay,Daniel Campos,Nick Craswell,Jimmy Lin
発行日	2025-04-21 16:20:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Support Evaluation for the TREC 2024 RAG Track: Comparing Human versus LLM Judges

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー