JuStRank: Benchmarking LLM Judges for System Ranking

要約

生成AIの急速な進歩を考えると、利用可能な多数のモデルと構成を体系的に比較して選択する必要があります。
このような評価の規模と汎用性により、LLMベースの裁判官を使用することは、この課題に対する説得力のある解決策となっています。
重要なことに、このアプローチでは、最初にLLMジャッジ自体の品質を検証する必要があります。
以前の研究では、LLM裁判官のインスタンスベースの評価に焦点を当てており、裁判官は、ソースシステムに不可知論されながら、一連の応答または応答ペアで評価されます。
この設定は、特定のシステムに対する裁判官の肯定的または否定的なバイアスなど、システムレベルのランキングに影響を与える重要な要因を見落としていると主張します。
このギャップに対処するために、システムランカーとしてLLM審査員の最初の大規模な研究を実施します。
システムスコアは、複数のシステム出力を介した判断スコアを集約することにより生成され、裁判官の品質は、結果のシステムランキングを人間ベースのランキングと比較することで評価されます。
全体的な裁判官の評価を超えて、我々の分析は、その決定性やバイアスを含む裁判官の行動のきめの細かい特性評価を提供します。

要約(オリジナル)

Given the rapid progress of generative AI, there is a pressing need to systematically compare and choose between the numerous models and configurations available. The scale and versatility of such evaluations make the use of LLM-based judges a compelling solution for this challenge. Crucially, this approach requires first to validate the quality of the LLM judge itself. Previous work has focused on instance-based assessment of LLM judges, where a judge is evaluated over a set of responses, or response pairs, while being agnostic to their source systems. We argue that this setting overlooks critical factors affecting system-level ranking, such as a judge’s positive or negative bias towards certain systems. To address this gap, we conduct the first large-scale study of LLM judges as system rankers. System scores are generated by aggregating judgment scores over multiple system outputs, and the judge’s quality is assessed by comparing the resulting system ranking to a human-based ranking. Beyond overall judge assessment, our analysis provides a fine-grained characterization of judge behavior, including their decisiveness and bias.

arxiv情報

著者	Ariel Gera,Odellia Boni,Yotam Perlitz,Roy Bar-Haim,Lilach Eden,Asaf Yehudai
発行日	2025-06-10 17:54:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

JuStRank: Benchmarking LLM Judges for System Ranking

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー