Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

要約

大規模な言語モデル（LLMS）は、複雑な推論に大きな約束を示しており、検証可能な報酬（RLVR）が重要な強化戦略である強化学習を示しています。
ただし、一般的な問題は「表面的な自己反省」であり、モデルは独自の出力を強く検証できません。
これに取り組むために設計された新しいオンラインRLフレームワークであるRise（自己検証による推論の強化）を紹介します。
LLMを明示的かつ同時にトレーニングして、単一の統合RLプロセス内で問題解決能力と自己検証能力の両方を改善します。
コアメカニズムには、結果の検証者から検証可能な報酬を活用して、ソリューション生成と自己検証の両方のタスクの両方に飛行中のフィードバックを提供することが含まれます。
各反復で、モデルはソリューションを生成し、その後、独自のオンポリシー生成ソリューションを批判し、両方の軌跡がポリシーの更新に貢献します。
多様な数学的推論ベンチマークに関する広範な実験は、上昇が一貫してモデルの問題解決精度を改善しながら、強力な自己検証スキルを促進することを示しています。
私たちの分析は、オンライン検証の利点と、検証計算の増加の利点を強調しています。
さらに、Riseモデルは、推論中に、より頻繁で正確な自己検証行動を示します。
これらの利点は、より堅牢で自己認識の推論を開発するための柔軟で効果的な道としての上昇を強化します。

要約(オリジナル)

Large Language Models (LLMs) show great promise in complex reasoning, with Reinforcement Learning with Verifiable Rewards (RLVR) being a key enhancement strategy. However, a prevalent issue is “superficial self-reflection”, where models fail to robustly verify their own outputs. We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this. RISE explicitly and simultaneously trains an LLM to improve both its problem-solving and self-verification abilities within a single, integrated RL process. The core mechanism involves leveraging verifiable rewards from an outcome verifier to provide on-the-fly feedback for both solution generation and self-verification tasks. In each iteration, the model generates solutions, then critiques its own on-policy generated solutions, with both trajectories contributing to the policy update. Extensive experiments on diverse mathematical reasoning benchmarks show that RISE consistently improves model’s problem-solving accuracy while concurrently fostering strong self-verification skills. Our analyses highlight the advantages of online verification and the benefits of increased verification compute. Additionally, RISE models exhibit more frequent and accurate self-verification behaviors during reasoning. These advantages reinforce RISE as a flexible and effective path towards developing more robust and self-aware reasoners.

arxiv情報

著者	Xiaoyuan Liu,Tian Liang,Zhiwei He,Jiahao Xu,Wenxuan Wang,Pinjia He,Zhaopeng Tu,Haitao Mi,Dong Yu
発行日	2025-05-19 17:59:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー