The Lessons of Developing Process Reward Models in Mathematical Reasoning

要約

プロセス報酬モデル (PRM) は、大規模言語モデル (LLM) の数学的推論におけるプロセス監視の有望なアプローチとして浮上しており、推論プロセスにおける中間エラーを特定して軽減することを目的としています。
しかし、効果的な PRM の開発は、特にデータの注釈と評価方法において大きな課題に直面しています。
この論文では、広範な実験を通じて、PRM で一般的に使用されるモンテカルロ (MC) 推定ベースのデータ合成では、通常、LLM-as-a-judge および人間によるアノテーション手法と比較してパフォーマンスと一般化が劣ることを示します。
MC 推定は完了モデルに依存して現在のステップの正確さを評価するため、ステップ検証が不正確になります。
さらに、PRM に対する従来の Best-of-N (BoN) 評価戦略における潜在的なバイアスを特定します。 (1) 信頼性の低い政策モデルは、正しい答えを持つ応答を生成しますが、プロセスに欠陥があり、BoN と PRM の評価基準の不整合につながります。
プロセス検証の目的。
(2) このような反応に対する PRM の許容度により、BoN スコアが上昇します。
(3) 既存の PRM では、最小スコアのかなりの部分が最終回答ステップに集中しており、BoN 最適化 PRM においてプロセスから結果ベースの評価への移行が明らかになりました。
これらの課題に対処するために、私たちは、MC 推定と LLM-as-a-judge を効果的に統合するコンセンサスフィルタリングメカニズムを開発し、応答レベルとステップレベルのメトリクスを組み合わせたより包括的な評価フレームワークを提唱します。
このメカニズムに基づいて、BoN 評価と段階的なエラー識別タスクにおけるモデルのパフォーマンスとデータ効率の両方を大幅に向上させます。
最後に、既存のオープンソース代替製品よりも優れたパフォーマンスを発揮し、プロセス監視モデルの構築における将来の研究のための実践的なガイドラインを提供する、新しい最先端の PRM をリリースします。

要約(オリジナル)

Process Reward Models (PRMs) emerge as a promising approach for process supervision in mathematical reasoning of Large Language Models (LLMs), which aim to identify and mitigate intermediate errors in the reasoning processes. However, the development of effective PRMs faces significant challenges, particularly in data annotation and evaluation methodologies. In this paper, through extensive experiments, we demonstrate that commonly used Monte Carlo (MC) estimation-based data synthesis for PRMs typically yields inferior performance and generalization compared to LLM-as-a-judge and human annotation methods. MC estimation relies on completion models to evaluate current-step correctness, leading to inaccurate step verification. Furthermore, we identify potential biases in conventional Best-of-N (BoN) evaluation strategies for PRMs: (1) The unreliable policy models generate responses with correct answers but flawed processes, leading to a misalignment between the evaluation criteria of BoN and the PRM objectives of process verification. (2) The tolerance of PRMs of such responses leads to inflated BoN scores. (3) Existing PRMs have a significant proportion of minimum scores concentrated on the final answer steps, revealing the shift from process to outcome-based assessment in BoN Optimized PRMs. To address these challenges, we develop a consensus filtering mechanism that effectively integrates MC estimation with LLM-as-a-judge and advocates a more comprehensive evaluation framework that combines response-level and step-level metrics. Based on the mechanisms, we significantly improve both model performance and data efficiency in the BoN evaluation and the step-wise error identification task. Finally, we release a new state-of-the-art PRM that outperforms existing open-source alternatives and provides practical guidelines for future research in building process supervision models.

arxiv情報

著者	Zhenru Zhang,Chujie Zheng,Yangzhen Wu,Beichen Zhang,Runji Lin,Bowen Yu,Dayiheng Liu,Jingren Zhou,Junyang Lin
発行日	2025-01-13 13:10:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

The Lessons of Developing Process Reward Models in Mathematical Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー