The Lessons of Developing Process Reward Models in Mathematical Reasoning

要約

プロセス報酬モデル（PRM）は、推論プロセスで中間エラーを特定して軽減することを目的とした、大規模な言語モデル（LLM）の数学的推論におけるプロセス監督のための有望なアプローチとして現れます。
ただし、効果的なPRMSの開発は、特にデータアノテーションと評価方法論において、重大な課題に直面しています。
この論文では、広範な実験を通じて、PRMSの一般的に使用されるモンテカルロ（MC）推定ベースのデータ合成は、通常、LLM-As-a-a-judgeおよび人間の注釈法と比較して劣ったパフォーマンスと一般化をもたらすことを実証します。
MC推定は、現在の段階の正確性を評価するために完了モデルに依存しており、不正確なステップ検証につながります。
さらに、PRMSの従来のベスト-N（Bon）評価戦略の潜在的なバイアスを特定します。（1）信頼性の低いポリシーモデルは、正解と欠陥のあるプロセスで応答を生成し、BONの評価基準とPRMのプロセス検証の目的との間の不整列につながります。
（2）そのような応答のPRMSの耐性は、膨らんだボンスコアにつながります。
（3）既存のPRMSには、最終的な回答ステップに集中した最小スコアのかなりの割合があり、BON最適化されたPRMSのプロセスに基づく評価への移行が明らかになりました。
これらの課題に対処するために、MC推定をLLM-A-A-Judgeと効果的に統合するコンセンサスフィルタリングメカニズムを開発し、応答レベルとステップレベルのメトリックを組み合わせたより包括的な評価フレームワークを提唱します。
メカニズムに基づいて、Bon評価のモデルパフォーマンスとデータ効率と段階的なエラー識別タスクの両方を大幅に改善します。
最後に、既存のオープンソースの代替案よりも優れた新しい最先端のPRMをリリースし、構築プロセス監督モデルの将来の研究のための実用的なガイドラインを提供します。

要約(オリジナル)

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

arxiv情報

著者	Zhenru Zhang,Chujie Zheng,Yangzhen Wu,Beichen Zhang,Runji Lin,Bowen Yu,Dayiheng Liu,Jingren Zhou,Junyang Lin
発行日	2025-06-05 16:34:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Lessons of Developing Process Reward Models in Mathematical Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー