VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

要約

検証可能な報酬（RLVR）による強化学習は、大規模な言語モデル（LLMS）を強化するための重要なテクニックとなっており、検証エンジニアリングが中心的な役割を果たしています。
ただし、次の命令におけるRLのベストプラクティスは未掘削装置のままです。
この作業では、RLの検証課題について、命令をフォローし、Verifを提案します。これは、ルールベースのコード検証とLLMベースの大規模な検証を大規模な推論モデル（QWQ-32B）からのLLMベースの検証を組み合わせた検証方法です。
このアプローチをサポートするために、関連する検証信号を持つ約22,000のインスタンスを含む高品質の命令に従うデータセットVerinstructを構築します。
Verifを使用したRLトレーニングを2つのモデルに適用し、いくつかの代表的な指導に従うベンチマークで大幅な改善を達成します。
訓練されたモデルは、同等のサイズのモデル間で最先端のパフォーマンスに到達し、目に見えない制約に合わせてよく一般化します。
さらに、それらの一般的な能力は影響を受けていないことを観察し、Verifを使用したRLを既存のRLレシピに統合して、全体的なモデルのパフォーマンスを向上させることができることを示唆しています。
https://github.com/thu-keg/verifで将来の研究を促進するために、データセット、コード、モデルをリリースしました。

要約(オリジナル)

Reinforcement learning with verifiable rewards (RLVR) has become a key technique for enhancing large language models (LLMs), with verification engineering playing a central role. However, best practices for RL in instruction following remain underexplored. In this work, we explore the verification challenge in RL for instruction following and propose VerIF, a verification method that combines rule-based code verification with LLM-based verification from a large reasoning model (e.g., QwQ-32B). To support this approach, we construct a high-quality instruction-following dataset, VerInstruct, containing approximately 22,000 instances with associated verification signals. We apply RL training with VerIF to two models, achieving significant improvements across several representative instruction-following benchmarks. The trained models reach state-of-the-art performance among models of comparable size and generalize well to unseen constraints. We further observe that their general capabilities remain unaffected, suggesting that RL with VerIF can be integrated into existing RL recipes to enhance overall model performance. We have released our datasets, codes, and models to facilitate future research at https://github.com/THU-KEG/VerIF.

arxiv情報

著者	Hao Peng,Yunjia Qi,Xiaozhi Wang,Bin Xu,Lei Hou,Juanzi Li
発行日	2025-06-11 17:10:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VerIF: Verification Engineering for Reinforcement Learning in Instruction Following

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー