TreeRL: LLM Reinforcement Learning with On-Policy Tree Search

要約

ツリー検索を備えた強化学習（RL）は、従来の推論タスクで優れたパフォーマンスを実証しています。
結果の監督を伴う従来の独立したチェーンサンプリング戦略と比較して、ツリー検索により、推論スペースをより適切に調査し、RLトレーニング中に密集したポリックプロセスの報酬を提供しますが、オンポリシーLLM RLでは不足していないままです。
RLトレーニングのポリシーツリー検索を直接組み込んだ強化学習フレームワークであるToreerlを提案します。
私たちのアプローチには、中間の監督が含まれ、別の報酬モデルトレーニングの必要性を排除します。
既存のアプローチは通常、分布の不一致や報酬のハッキングに苦しむ可能性のある個別のプロセス報酬モデルをトレーニングします。
また、ランダムな分岐を使用するのではなく、廃止中間のステップから戦略的に分岐することにより、同じ世代のトークン予算の下でより高い検索効率を達成する費用対効果の高いツリー検索アプローチを導入します。
挑戦的な数学とコードの推論ベンチマークに関する実験は、TREERLが従来のChainRLと比較して優れたパフォーマンスを達成し、LLMのツリー検索の可能性を強調することを示しています。
Treerlはhttps://github.com/thudm/treirelでオープンソーシングされています。

要約(オリジナル)

Reinforcement learning (RL) with tree search has demonstrated superior performance in traditional reasoning tasks. Compared to conventional independent chain sampling strategies with outcome supervision, tree search enables better exploration of the reasoning space and provides dense, on-policy process rewards during RL training but remains under-explored in On-Policy LLM RL. We propose TreeRL, a reinforcement learning framework that directly incorporates on-policy tree search for RL training. Our approach includes intermediate supervision and eliminates the need for a separate reward model training. Existing approaches typically train a separate process reward model, which can suffer from distribution mismatch and reward hacking. We also introduce a cost-effective tree search approach that achieves higher search efficiency under the same generation token budget by strategically branching from high-uncertainty intermediate steps rather than using random branching. Experiments on challenging math and code reasoning benchmarks demonstrate that TreeRL achieves superior performance compared to traditional ChainRL, highlighting the potential of tree search for LLM. TreeRL is open-sourced at https://github.com/THUDM/TreeRL.

arxiv情報

著者	Zhenyu Hou,Ziniu Hu,Yujiang Li,Rui Lu,Jie Tang,Yuxiao Dong
発行日	2025-06-13 15:52:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TreeRL: LLM Reinforcement Learning with On-Policy Tree Search

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー