An Empirical Study on Eliciting and Improving R1-like Reasoning Models

要約

このレポートでは、Stillプロジェクトの一部として、スロー推測モデルの開発に関する3番目の技術レポートを提示します。
技術的な経路がより明確になるにつれて、RLトレーニングのスケーリングは、このような推論モデルを実装するための中心的なテクニックになりました。
RLトレーニングに影響を与えるさまざまな要因の効果を体系的に実験し、文書化し、ベースモデルと微調整されたモデルの両方で実験を実施します。
具体的には、RLトレーニングアプローチがQWEN2.5-32Bベースモデルを一貫して改善し、応答長とテスト精度の両方を向上させることを実証します。
さらに、DeepSeek-R1-Distill-Qwen-1.5Bのようなモデルがすでに高性能レベルを達成している場合でも、RLトレーニングを通じてさらに洗練され、AIME 2024で39.33％の精度に達することができることを示しています。RLトレーニングを超えて、ツール操作の使用を調査し、大きな合理的なパフォーマンスを強化することを検討します。
このアプローチは、AIME 2024での貪欲な検索で86.67％の顕著な精度を達成し、モデル能力の向上におけるその効果を強調しています。
StillプロジェクトWebサイトhttps://github.com/rucaibox/slow_thinking_with_llmsでリソースをリリースします。

要約(オリジナル)

In this report, we present the third technical report on the development of slow-thinking models as part of the STILL project. As the technical pathway becomes clearer, scaling RL training has become a central technique for implementing such reasoning models. We systematically experiment with and document the effects of various factors influencing RL training, conducting experiments on both base models and fine-tuned models. Specifically, we demonstrate that our RL training approach consistently improves the Qwen2.5-32B base models, enhancing both response length and test accuracy. Furthermore, we show that even when a model like DeepSeek-R1-Distill-Qwen-1.5B has already achieved a high performance level, it can be further refined through RL training, reaching an accuracy of 39.33% on AIME 2024. Beyond RL training, we also explore the use of tool manipulation, finding that it significantly boosts the reasoning performance of large reasoning models. This approach achieves a remarkable accuracy of 86.67% with greedy search on AIME 2024, underscoring its effectiveness in enhancing model capabilities. We release our resources at the STILL project website: https://github.com/RUCAIBox/Slow_Thinking_with_LLMs.

arxiv情報

著者	Zhipeng Chen,Yingqian Min,Beichen Zhang,Jie Chen,Jinhao Jiang,Daixuan Cheng,Wayne Xin Zhao,Zheng Liu,Xu Miao,Yang Lu,Lei Fang,Zhongyuan Wang,Ji-Rong Wen
発行日	2025-03-06 15:34:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

An Empirical Study on Eliciting and Improving R1-like Reasoning Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー