DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training

要約

大規模な言語モデル（LLM）は最近、さまざまな複雑な推論ベンチマークで顕著なパフォーマンスを達成しましたが、学術コミュニティには基本モデルトレーニングプロセスとデータ品質の詳細な理解がまだありません。
これに対処するために、さまざまな難易度レベルの約334万の一意のクエリと、複数のパスで複数のモデルによって生成される約4,000万件の蒸留応答を含む大規模で難易度の推論データセットを構築します。
合格率と変動係数（CV）を活用すると、推論機能を強化するために最も価値のあるトレーニングデータを正確に選択します。
特に、ベースモデルに基づいた推論に焦点を合わせたトレーニングには、効果的なトレーニングのためにより高い学習率が必要であることを示すトレーニングパターンシフトが観察されます。
この慎重に選択されたデータを使用して、基本モデルの推論機能を大幅に改善し、AIME2024数学的推論ベンチマークで79.2 \％の合格率を達成しました。
この結果は、現在の蒸留モデルのほとんどを上回り、最先端のパフォーマンスに密接に近づいています。
データ処理、難易度評価、およびトレーニング方法の詳細な説明を提供し、オープンソースの長期的なLLMの急速な進歩を促進するために、すべてのデータセットと方法を公開しています。
データセットは、https：//huggingface.co/datasets/am-team/am-deepseek-distill-40mで入手できます

要約(オリジナル)

Although large language models (LLMs) have recently achieved remarkable performance on various complex reasoning benchmarks, the academic community still lacks an in-depth understanding of base model training processes and data quality. To address this, we construct a large-scale, difficulty-graded reasoning dataset containing approximately 3.34 million unique queries of varying difficulty levels and about 40 million distilled responses generated by multiple models over several passes. Leveraging pass rate and Coefficient of Variation (CV), we precisely select the most valuable training data to enhance reasoning capability. Notably, we observe a training pattern shift, indicating that reasoning-focused training based on base models requires higher learning rates for effective training. Using this carefully selected data, we significantly improve the reasoning capabilities of the base model, achieving a pass rate of 79.2\% on the AIME2024 mathematical reasoning benchmark. This result surpasses most current distilled models and closely approaches state-of-the-art performance. We provide detailed descriptions of our data processing, difficulty assessment, and training methodology, and have publicly released all datasets and methods to promote rapid progress in open-source long-reasoning LLMs. The dataset is available at: https://huggingface.co/datasets/a-m-team/AM-DeepSeek-Distilled-40M

arxiv情報

著者	Xiaoyu Tian,Sitong Zhao,Haotian Wang,Shuaiting Chen,Yiping Peng,Yunjie Ji,Han Zhao,Xiangang Li
発行日	2025-04-24 13:57:53+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeepDistill: Enhancing LLM Reasoning Capabilities via Large-Scale Difficulty-Graded Data Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー