TreeRPO: Tree Relative Policy Optimization

要約

大規模な言語モデル（LLM）は、検証可能な報酬（RLVR）方法による強化学習を通じて、顕著な推論能力を示しています。
ただし、既存のアプローチの重要な制限は、完全な軌道レベルで定義された報酬が、推論プロセスの中間ステップを最適化するための不十分なガイダンスを提供することです。
これに対処するために、ツリーサンプリングを使用したさまざまな推論ステップでの報酬の数学的期待を推定する新しい方法である\ textBf {\ name}を紹介します。
別のステップ報酬モデルに依存する以前の方法とは異なり、\ nameはこのサンプリングプロセスを通じてこれらの報酬を直接推定します。
GRPOのグループ相関報酬トレーニングメカニズムに基づいて、\ Nameは、ツリーサンプリング中に生成されたステップレベルグループに基づいて革新的に報酬を計算します。
この進歩により、\名はきめ細かい報酬信号を生成することができ、LLMの学習プロセスと全体的なパフォーマンスを大幅に向上させることができます。
実験結果は、\ Nameアルゴリズムがテストベンチマーク上のQWEN-2.5-MATHの平均パス@1精度を大幅に改善し、19.0 \％から35.5 \％に増加することを示しています。
さらに、\名はパフォーマンスでGRPOを2.9％上回ると同時に平均応答長を18.1 \％削減し、その有効性と効率を紹介します。
私たちのコードは、\ href {https://github.com/yangzhch6/treeerpo} {https://github.com/yangzhch6/treerpo}で入手できます。

要約(オリジナル)

Large Language Models (LLMs) have shown remarkable reasoning capabilities through Reinforcement Learning with Verifiable Rewards (RLVR) methods. However, a key limitation of existing approaches is that rewards defined at the full trajectory level provide insufficient guidance for optimizing the intermediate steps of a reasoning process. To address this, we introduce \textbf{\name}, a novel method that estimates the mathematical expectations of rewards at various reasoning steps using tree sampling. Unlike prior methods that rely on a separate step reward model, \name directly estimates these rewards through this sampling process. Building on the group-relative reward training mechanism of GRPO, \name innovatively computes rewards based on step-level groups generated during tree sampling. This advancement allows \name to produce fine-grained and dense reward signals, significantly enhancing the learning process and overall performance of LLMs. Experimental results demonstrate that our \name algorithm substantially improves the average Pass@1 accuracy of Qwen-2.5-Math on test benchmarks, increasing it from 19.0\% to 35.5\%. Furthermore, \name significantly outperforms GRPO by 2.9\% in performance while simultaneously reducing the average response length by 18.1\%, showcasing its effectiveness and efficiency. Our code will be available at \href{https://github.com/yangzhch6/TreeRPO}{https://github.com/yangzhch6/TreeRPO}.

arxiv情報

著者	Zhicheng Yang,Zhijiang Guo,Yinya Huang,Xiaodan Liang,Yiwei Wang,Jing Tang
発行日	2025-06-05 15:56:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TreeRPO: Tree Relative Policy Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー