Process Reward Modeling with Entropy-Driven Uncertainty

要約

このペーパーでは、エントロピー駆動型の統一プロセス報酬モデル（EDU-PRM）を紹介します。これは、トレーニングコストを大幅に削減しながら、プロセスの監督における最新のパフォーマンスに近い新しいフレームワークです。
EDU-PRMは、ロジット分布エントロピーを使用してトークン生成中の高逃走領域を動的に特定するエントロピー誘導動的ステップ分割メカニズムを導入します。
この自己評価機能により、手動で細めの注釈なしで正確なステップレベルのフィードバックが可能になり、プロセスの監督における重要な課題に対処します。
わずか7,500のEDU-PRMで生成されたトレーニングクエリのQWEN2.5-72Bモデルでの実験は、完全なQWEN2.5-72B-PRM（71.1％対71.6％）に密接に近似し、前の方法と比較してクエリコストを98％削減する精度を示しています。
この作業は、スケーラブルなプロセス報酬モデルトレーニングのための効率的なアプローチとしてEDU-PRMを確立します。

要約(オリジナル)

This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training.

arxiv情報

著者	Lang Cao,Renhong Chen,Yingtian Zou,Chao Peng,Wu Ning,Huacong Xu,Qian Chen,Yuxian Wang,Peishuo Su,Mofan Peng,Zijie Chen,Yitong Li
発行日	2025-03-28 08:33:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Process Reward Modeling with Entropy-Driven Uncertainty

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー