VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

要約

Openai O1やDeepseek-R1などの大規模な推論モデルは、推論の領域で顕著なパフォーマンスを達成しています。
彼らのトレーニングの重要な要素は、強化学習（RL）に検証可能な報酬を組み込むことです。
ただし、既存の報酬ベンチマークでは、参照ベースの報酬システムを評価しないため、研究者はRLで使用される検証剤の精度を理解しています。
このホワイトペーパーでは、参照ベースの報酬システムのパフォーマンスを評価するために設計された2つのベンチマーク、VerifyBenchとVerifyififyBenchハードを紹介します。
これらのベンチマークは、細心のデータ収集とキュレーションによって構築され、その後、高品質を確保するために慎重な人間の注釈が続きます。
現在のモデルは、検証ベンチと検証ベンチハード、特に小規模なモデルの両方で改善のかなりの余地を示しています。
さらに、評価結果の徹底的かつ包括的な分析を実施し、参照ベースの報酬システムを理解および開発するための洞察を提供します。
提案されているベンチマークは、Verifierの精度の開発と、RLを介してRLを介してトレーニングされたモデルの推論機能の開発をガイドするための効果的なツールとして機能します。

要約(オリジナル)

Large reasoning models such as OpenAI o1 and DeepSeek-R1 have achieved remarkable performance in the domain of reasoning. A key component of their training is the incorporation of verifiable rewards within reinforcement learning (RL). However, existing reward benchmarks do not evaluate reference-based reward systems, leaving researchers with limited understanding of the accuracy of verifiers used in RL. In this paper, we introduce two benchmarks, VerifyBench and VerifyBench-Hard, designed to assess the performance of reference-based reward systems. These benchmarks are constructed through meticulous data collection and curation, followed by careful human annotation to ensure high quality. Current models still show considerable room for improvement on both VerifyBench and VerifyBench-Hard, especially smaller-scale models. Furthermore, we conduct a thorough and comprehensive analysis of evaluation results, offering insights for understanding and developing reference-based reward systems. Our proposed benchmarks serve as effective tools for guiding the development of verifier accuracy and the reasoning capabilities of models trained via RL in reasoning tasks.

arxiv情報

著者	Yuchen Yan,Jin Jiang,Zhenbang Ren,Yijun Li,Xudong Cai,Yang Liu,Xin Xu,Mengdi Zhang,Jian Shao,Yongliang Shen,Jun Xiao,Yueting Zhuang
発行日	2025-05-21 17:54:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー