PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

要約

プロセスレベルの報酬モデル (PRM) は、複雑な推論および意思決定タスクに不可欠であり、推論プロセスでは中間の各ステップが重要な役割を果たします。
言語モデルは推論プロセス中にさまざまな種類のエラーを起こしやすいため、PRM には現実世界のシナリオでさまざまな種類の暗黙的なエラーを検出するための微妙な機能が必要です。
ただし、現在のベンチマークは主にステップの正確さに焦点を当てており、PRM のパフォーマンスを系統的に評価できていません。
このギャップに対処するために、PRM のきめ細かいエラー検出機能を評価するために特別に設計されたプロセスレベルのベンチマークである PRMBench を導入します。
PRMBench は、慎重に設計された 6,216 の問題と 83,456 のステップレベルのラベルで構成され、単純さ、健全性、感度などの多次元にわたってモデルを評価します。
オープンソース PRM と批判モデルとしてプロンプトされたクローズドソースの大規模言語モデルの両方にわたる 15 のモデルに関する実験で、現在の PRM の重大な弱点が明らかになりました。
これらの発見は、プロセスレベルの評価に固有の課題を強調し、将来の研究の重要な方向性を強調しています。
PRMBench が PRM の評価と開発の研究を進めるための強力なベンチになることを願っています。

要約(オリジナル)

Process-level Reward Models (PRMs) are crucial for complex reasoning and decision-making tasks, where each intermediate step plays an important role in the reasoning process. Since language models are prone to various types of errors during the reasoning process, PRMs are required to possess nuanced capabilities for detecting various implicit error types in real-world scenarios. However, current benchmarks primarily focus on step correctness, failing to evaluate PRMs’ performance systematically. To address this gap, we introduce PRMBench, a process-level benchmark specifically designed to assess the fine-grained error detection capabilities of PRMs. PRMBench comprises 6,216 carefully designed problems and 83,456 step-level labels, evaluating models across multiple dimensions, including simplicity, soundness, and sensitivity. In our experiments on 15 models, spanning both open-source PRMs and closed-source large language models prompted as critic models, we uncover significant weaknesses in current PRMs. These findings underscore the challenges inherent in process-level evaluation and highlight key directions for future research. We hope PRMBench can be a robust bench for advancing research on PRM evaluation and development.

arxiv情報

著者	Mingyang Song,Zhaochen Su,Xiaoye Qu,Jiawei Zhou,Yu Cheng
発行日	2025-01-07 12:33:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー