DAPO: An Open-Source LLM Reinforcement Learning System at Scale

要約

推論スケーリングは、LLMSが前例のない推論能力を備えており、複雑な推論を引き出すコアテクニックとして強化学習を備えています。
ただし、最先端の推論LLMの主要な技術的詳細は隠されています（Openai O1ブログやDeepseek R1テクニカルレポートなど）。したがって、コミュニティはRLトレーニング結果を再現するのに苦労しています。
$ \ textbf {d} $ ecoupledクリップと$ \ textbf {d} $ ynamic s $ \ textbf {a} $ mpling $ \ textbf {p} $ \ textbf {o} $ ptimization（$ \ textbf {dapo} $ a agorithm n.
QWEN2.5-32Bベースモデルを使用して、AIME 2024で50ポイントを達成する最先端の大規模RLシステム。
トレーニングの詳細を差し控えた以前の作品とは異なり、大規模なLLM RLを成功させるアルゴリズムの4つの重要な手法を紹介します。
さらに、Verlフレームワークに基づいて構築されたトレーニングコードと、慎重にキュレーションされ、処理されたデータセットがオープンソースをかけます。
オープンソースシステムのこれらのコンポーネントは、再現性を高め、大規模なLLM RLの将来の研究をサポートします。

要約(オリジナル)

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ecoupled Clip and $\textbf{D}$ynamic s$\textbf{A}$mpling $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{DAPO}$) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

arxiv情報

著者	Qiying Yu,Zheng Zhang,Ruofei Zhu,Yufeng Yuan,Xiaochen Zuo,Yu Yue,Tiantian Fan,Gaohong Liu,Lingjun Liu,Xin Liu,Haibin Lin,Zhiqi Lin,Bole Ma,Guangming Sheng,Yuxuan Tong,Chi Zhang,Mofan Zhang,Wang Zhang,Hang Zhu,Jinhua Zhu,Jiaze Chen,Jiangjie Chen,Chengyi Wang,Hongli Yu,Weinan Dai,Yuxuan Song,Xiangpeng Wei,Hao Zhou,Jingjing Liu,Wei-Ying Ma,Ya-Qin Zhang,Lin Yan,Mu Qiao,Yonghui Wu,Mingxuan Wang
発行日	2025-03-18 17:49:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー