Better Process Supervision with Bi-directional Rewarding Signals

要約

プロセス監督、つまり、各ステップを評価することは、推論計算の増加を伴う複雑な大手言語モデル（LLM）の推論とテスト時間検索にとって重要です。
プロセス報酬モデル（PRMS）で表される既存のアプローチは、主に現在のステップまでの信号に報いることに焦点を当て、一方向性の性質を示し、最終ターゲットまでの距離をモデル化するメカニズムを欠いています。
この問題に対処するために、A*アルゴリズムからインスピレーションを引き出します。これは、効果的な監視信号が発生したコストとターゲットに到達するための推定コストを同時に考慮する必要があると述べています。
この重要な洞察に基づいて、以前のステップの正確性を評価するだけでなく、将来の成功の確率をモデル化する新しいプロセス監督モデルであるBirmを紹介します。
数学的推論タスクに関する広範な実験を実施し、BirmがLLM推論ステップのより正確な評価を提供し、Best-of-Nサンプリング方法の下でPRMよりもGaokao2023で3.1％の改善を達成することを実証します。
さらに、検索ベースの戦略では、Birmはより包括的なガイダンスを提供し、MATH-500でそれぞれ5.0％、PRMを3.8％上回ります。

要約(オリジナル)

Process supervision, i.e., evaluating each step, is critical for complex large language model (LLM) reasoning and test-time searching with increased inference compute. Existing approaches, represented by process reward models (PRMs), primarily focus on rewarding signals up to the current step, exhibiting a one-directional nature and lacking a mechanism to model the distance to the final target. To address this problem, we draw inspiration from the A* algorithm, which states that an effective supervisory signal should simultaneously consider the incurred cost and the estimated cost for reaching the target. Building on this key insight, we introduce BiRM, a novel process supervision model that not only evaluates the correctness of previous steps but also models the probability of future success. We conduct extensive experiments on mathematical reasoning tasks and demonstrate that BiRM provides more precise evaluations of LLM reasoning steps, achieving an improvement of 3.1% on Gaokao2023 over PRM under the Best-of-N sampling method. Besides, in search-based strategies, BiRM provides more comprehensive guidance and outperforms ORM by 5.0% and PRM by 3.8% respectively on MATH-500.

arxiv情報

著者	Wenxiang Chen,Wei He,Zhiheng Xi,Honglin Guo,Boyang Hong,Jiazheng Zhang,Rui Zheng,Nijun Li,Tao Gui,Yun Li,Qi Zhang,Xuanjing Huang
発行日	2025-03-06 17:03:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Better Process Supervision with Bi-directional Rewarding Signals

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー