J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

要約

AIの進行は評価の質によってボトルネックされており、強力なLLM As-A-A-Judgeモデルがコアソリューションであることが証明されています。
改善された判断能力は、考え方の強い推論によって可能になり、そのようなモデルをトレーニングするための最良のレシピを見つける必要性を動機付けています。
この作業では、そのようなモデルをトレーニングするための強化学習アプローチであるJ1を紹介します。
私たちの方法は、検証を奨励し、判断バイアスを緩和する検証可能な報酬を備えた検証可能なプロンプトと非検証不可のプロンプトの両方を判断タスクに変換します。
特に、私たちのアプローチは、DeepSeek-R1から蒸留されたモデルを含む、これらのサイズでトレーニングされた場合、他のすべての既存の8Bまたは70Bモデルよりも優れています。
また、J1は、より小さなモデルをトレーニングしているにもかかわらず、いくつかのベンチマークでO1-MINI、さらにはR1を上回ります。
ペアワイズJ1対ポイントワイズJ1モデル、オフライン対オンライントレーニングレシピ、報酬戦略、シードプロンプト、思考の長さとコンテンツのバリエーションを比較する分析とアブレーションを提供します。
私たちのモデルは、評価基準の概要を学習し、自己生成された参照回答と比較し、モデル応答の正確性を再評価することにより、より良い判断を下すことがわかります。

要約(オリジナル)

The progress of AI is bottlenecked by the quality of evaluation, and powerful LLM-as-a-Judge models have proved to be a core solution. Improved judgment ability is enabled by stronger chain-of-thought reasoning, motivating the need to find the best recipes for training such models to think. In this work we introduce J1, a reinforcement learning approach to training such models. Our method converts both verifiable and non-verifiable prompts to judgment tasks with verifiable rewards that incentivize thinking and mitigate judgment bias. In particular, our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model. We provide analysis and ablations comparing Pairwise-J1 vs Pointwise-J1 models, offline vs online training recipes, reward strategies, seed prompts, and variations in thought length and content. We find that our models make better judgments by learning to outline evaluation criteria, comparing against self-generated reference answers, and re-evaluating the correctness of model responses.

arxiv情報

著者	Chenxi Whitehouse,Tianlu Wang,Ping Yu,Xian Li,Jason Weston,Ilia Kulikov,Swarnadeep Saha
発行日	2025-05-15 14:05:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー