VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

要約

値ベースのパラダイム内の推論モデルに合わせた新しいフレームワークである、推論モデルのための価値ベースの拡張近位政策最適化フレームワークを提示します。
QWEN 32Bの事前訓練モデルに基づいて構築されたAIME 2024データセットVAPOのベンチマークは、$ \ MathBF {60.4} $の最先端のスコアを達成します。
同一の実験設定下での直接比較では、VAPOは、以前に報告されたDeepSeek-R1-Zero-Qwen-32BおよびDAPOの結果を10ポイント以上上回ります。
VAPOのトレーニングプロセスは、その安定性と効率を際立たせています。
わずか5,000ステップ内で最先端のパフォーマンスに達します。
さらに、複数の独立した実行にわたって、トレーニングクラッシュは発生せず、その信頼性を強調しています。
この研究は、価値ベースの強化学習フレームワークを使用して、長い考え方（ロングコット）の推論を掘り下げています。
価値ベースの方法を悩ませる3つの重要な課題を特定します：値モデルバイアス、不均一なシーケンス長の存在、および報酬信号のスパース。
Vapoは、体系的な設計を通じて、これらの課題を効果的に緩和する統合ソリューションを提供し、長期的な推論タスクのパフォーマンスを向上させます。

要約(オリジナル)

We present VAPO, Value-based Augmented Proximal Policy Optimization framework for reasoning models., a novel framework tailored for reasoning models within the value-based paradigm. Benchmarked the AIME 2024 dataset, VAPO, built on the Qwen 32B pre-trained model, attains a state-of-the-art score of $\mathbf{60.4}$. In direct comparison under identical experimental settings, VAPO outperforms the previously reported results of DeepSeek-R1-Zero-Qwen-32B and DAPO by more than 10 points. The training process of VAPO stands out for its stability and efficiency. It reaches state-of-the-art performance within a mere 5,000 steps. Moreover, across multiple independent runs, no training crashes occur, underscoring its reliability. This research delves into long chain-of-thought (long-CoT) reasoning using a value-based reinforcement learning framework. We pinpoint three key challenges that plague value-based methods: value model bias, the presence of heterogeneous sequence lengths, and the sparsity of reward signals. Through systematic design, VAPO offers an integrated solution that effectively alleviates these challenges, enabling enhanced performance in long-CoT reasoning tasks.

arxiv情報

著者	Yu Yue,Yufeng Yuan,Qiying Yu,Xiaochen Zuo,Ruofei Zhu,Wenyuan Xu,Jiaze Chen,Chengyi Wang,TianTian Fan,Zhengyin Du,Xiangpeng Wei,Xiangyu Yu,Gaohong Liu,Juncai Liu,Lingjun Liu,Haibin Lin,Zhiqi Lin,Bole Ma,Chi Zhang,Mofan Zhang,Wang Zhang,Hang Zhu,Ru Zhang,Xin Liu,Mingxuan Wang,Yonghui Wu,Lin Yan
発行日	2025-04-08 03:06:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VAPO: Efficient and Reliable Reinforcement Learning for Advanced Reasoning Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー