J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

要約

大規模な言語モデル（LLM）開発のペースの増加に対応するために、モデルの出力評価は、LLM自体が他のモデル出力の評価と批判を担当する自動評価に至るまで、時間のかかる人間の評価から離れて移行しました。
LLM-As-Judgeモデルは、チャットの品質などの比較的単純なドメインを評価するのに優れている生成評価者のクラスですが、モデル応答がより実質的で挑戦的なコンテンツを含む集中的なドメインを推論するのに苦労しています。
既存の裁判官の欠点を改善するために、補強学習（RL）で訓練裁判官を探求します。
3つの重要な貢献をします。（1）同等の初期状態グループ相対政策最適化（EIS-GRPO）アルゴリズムを提案します。これにより、より複雑な評価設定で発生する位置バイアスに堅牢であるように裁判官を訓練できます。
（2）ReasonIngJudgeBenchを紹介します。これは、以前の作業でカバーされていない多様な推論設定で裁判官を評価するベンチマークです。
（3）GPT-4Oを上回るEIS-GRPOで訓練された7Bの裁判官であり、次の最高の小さな裁判官を6.7％と9％で訓練し、Judgent-BenchとReasoningJudgeBenchの両方で大規模なGRPO訓練を受けた裁判官のパフォーマンスを一致させるか、それを超える7Bの裁判官を訓練します。

要約(オリジナル)

To keep pace with the increasing pace of large language models (LLM) development, model output evaluation has transitioned away from time-consuming human evaluation to automatic evaluation, where LLMs themselves are tasked with assessing and critiquing other model outputs. LLM-as-judge models are a class of generative evaluators that excel in evaluating relatively simple domains, like chat quality, but struggle in reasoning intensive domains where model responses contain more substantive and challenging content. To remedy existing judge shortcomings, we explore training judges with reinforcement learning (RL). We make three key contributions: (1) We propose the Equivalent Initial State Group Relative Policy Optimization (EIS-GRPO) algorithm, which allows us to train our judge to be robust to positional biases that arise in more complex evaluation settings. (2) We introduce ReasoningJudgeBench, a benchmark that evaluates judges in diverse reasoning settings not covered by prior work. (3) We train Judge for Reasoning (J4R), a 7B judge trained with EIS-GRPO that outperforms GPT-4o and the next best small judge by 6.7% and 9%, matching or exceeding the performance of larger GRPO-trained judges on both JudgeBench and ReasoningJudgeBench.

arxiv情報

著者	Austin Xu,Yilun Zhou,Xuan-Phi Nguyen,Caiming Xiong,Shafiq Joty
発行日	2025-05-20 14:57:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー