Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

要約

効率的で堅牢な推論能力を実現するために、強化学習（RL）を介して最適化された、エンサリの混合物（MOE）ベースの大きな言語モデルを提示します。
27億5,000万のアクティブ化されたパラメーターを備えた168億パラメーターモデルである公開されているLing-Liteモデルに基づいて構築されたこのアプローチは、挑戦的なベンチマーク（例えば、AIME、LiveCodeBench、GPQA-Diamond）での最先端の（SOTA）小規模な推論モデルのパフォーマンスと一致します。
これを達成するために、RLと蒸留を統合する共同トレーニングパイプラインを導入し、MOE RLトレーニングにおける文書化されていない課題を明らかにします。
まず、RLトレーニング中の最適化の不安定性を特定し、トレーニングの安定性を高め、アルゴリズムシステムの共同設計方法を介して計算スループットを改善する新しいアプローチである、制約付きコンテキスト計算ポリシー最適化（C3PO）を提案します。
第二に、検証メトリックではなく、RLトレーニングのエントロピー損失に基づいて蒸留チェックポイントを選択すると、その後のRLトレーニングで優れたパフォーマンス効率のトレードオフが生じることを経験的に実証します。
最後に、マルチドメインデータ統合を調和させるための2段階のトレーニングパラダイムを開発し、混合データセットでのトレーニングで生じるドメインの競合に対処します。
モデル、データセット、およびコードをリリースします。

要約(オリジナル)

We present Ring-lite, a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL) to achieve efficient and robust reasoning capabilities. Built upon the publicly available Ling-lite model, a 16.8 billion parameter model with 2.75 billion activated parameters, our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks (e.g., AIME, LiveCodeBench, GPQA-Diamond) while activating only one-third of the parameters required by comparable models. To accomplish this, we introduce a joint training pipeline integrating distillation with RL, revealing undocumented challenges in MoE RL training. First, we identify optimization instability during RL training, and we propose Constrained Contextual Computation Policy Optimization(C3PO), a novel approach that enhances training stability and improves computational throughput via algorithm-system co-design methodology. Second, we empirically demonstrate that selecting distillation checkpoints based on entropy loss for RL training, rather than validation metrics, yields superior performance-efficiency trade-offs in subsequent RL training. Finally, we develop a two-stage training paradigm to harmonize multi-domain data integration, addressing domain conflicts that arise in training with mixed dataset. We will release the model, dataset, and code.

arxiv情報

著者	Ling Team,Bin Hu,Cai Chen,Deng Zhao,Ding Liu,Dingnan Jin,Feng Zhu,Hao Dai,Hongzhi Luan,Jia Guo,Jiaming Liu,Jiewei Wu,Jun Mei,Jun Zhou,Junbo Zhao,Junwu Xiong,Kaihong Zhang,Kuan Xu,Lei Liang,Liang Jiang,Liangcheng Fu,Longfei Zheng,Qiang Gao,Qing Cui,Quan Wan,Shaomian Zheng,Shuaicheng Li,Tongkai Yang,Wang Ren,Xiaodong Yan,Xiaopei Wan,Xiaoyun Feng,Xin Zhao,Xinxing Yang,Xinyu Kong,Xuemin Yang,Yang Li,Yingting Wu,Yongkang Liu,Zhankai Xu,Zhenduo Zhang,Zhenglei Zhou,Zhenyu Huang,Zhiqiang Zhang,Zihao Wang,Zujie Wen
発行日	2025-06-18 02:53:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー