LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

要約

抽出読解力質問応答（QA）データセットは通常、正確な一致（EM）とF1スコアを使用して評価されますが、これらのメトリックはモデルのパフォーマンスを完全にキャプチャできないことがよくあります。
大規模な言語モデル（LLM）の成功により、審査員（LLM-As-a-judge）としてのサービスを含むさまざまなタスクで採用されています。
このホワイトペーパーでは、4つの読解力データセットにわたってLLM-A-A-Judgeを使用して、QAモデルのパフォーマンスを再評価します。
これらのタスクにおけるLLM-A-a-Judgeの有効性を評価するために、LLMのさまざまなファミリーとさまざまな回答タイプを調べます。
私たちの結果は、LLM-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-A-deudgeが人間の判断と非常に相関しており、従来のEM/F1メトリックに取って代わることができることを示しています。
LLM-as-a-judgeを使用することにより、人間の判断との相関は、0.17（EM）および0.36（F1スコア）から0.85に大幅に改善されます。
これらの調査結果は、EMおよびF1メトリックがQAモデルの真のパフォーマンスを過小評価していることを確認しています。
LLM-as-a-judgeは、より難しい回答タイプ（ヨブなど）には完璧ではありませんが、依然としてEM/F1を上回ります。また、同じモデルがQAと判断タスクの両方で使用される場合、自己プレーファレンスなどのバイアスの問題は観察されません。

要約(オリジナル)

Extractive reading comprehension question answering (QA) datasets are typically evaluated using Exact Match (EM) and F1-score, but these metrics often fail to fully capture model performance. With the success of large language models (LLMs), they have been employed in various tasks, including serving as judges (LLM-as-a-judge). In this paper, we reassess the performance of QA models using LLM-as-a-judge across four reading comprehension QA datasets. We examine different families of LLMs and various answer types to evaluate the effectiveness of LLM-as-a-judge in these tasks. Our results show that LLM-as-a-judge is highly correlated with human judgments and can replace traditional EM/F1 metrics. By using LLM-as-a-judge, the correlation with human judgments improves significantly, from 0.17 (EM) and 0.36 (F1-score) to 0.85. These findings confirm that EM and F1 metrics underestimate the true performance of the QA models. While LLM-as-a-judge is not perfect for more difficult answer types (e.g., job), it still outperforms EM/F1, and we observe no bias issues, such as self-preference, when the same model is used for both the QA and judgment tasks.

arxiv情報

著者	Xanh Ho,Jiahao Huang,Florian Boudin,Akiko Aizawa
発行日	2025-04-16 11:08:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー