Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

要約

モデルの蒸留の最近の進歩は、高度な推論モデルからのデータ（例：Deepseek-R1、OpenaiのO1）が、複雑で効率的な学生モデルに複雑な推論能力を効果的に転送できることを示しています。
ただし、標準的なプラクティスでは、拒否サンプリングを採用しており、誤った推論の例を破棄します。
このペーパーでは、重要な質問に対処しています。オフラインの設定でLLM推論パフォーマンスを最大化するために、正と負の蒸留推論の両方のトレースをどのように効果的に活用できますか？
この目的のために、2段階のフレームワークである補強蒸留（REDI）を提案します。
ステージ1は、監視された微調整（SFT）を介して正の痕跡から学びます。
ステージ2は、提案されているREDI目標を通じて、正と負の両方のトレースを使用してモデルをさらに洗練します。
この斬新な目的は、この蒸留コンテキストでDPOやSIMPOなどの確立された方法を上回るシンプルで参照フリーの損失関数です。
私たちの経験的評価は、数学的推論タスクに関するDPO/SIMPOと組み合わせたベースライン拒否サンプリングSFTまたはSFTに対するRediの優位性を示しています。
特に、Open Open-R1データセットのわずか131Kの正と否定的な例で訓練を受けたQwen-Redi-1.5Bモデルは、Math-500（Pass@1）で83.1％のスコアを達成します。
そのパフォーマンスは、さまざまな数学的推論ベンチマークにわたって、DeepSeek-R1-Distill-Qwen-1.5B（800K独自のデータで訓練後のモデル）のパフォーマンスと一致または上回り、オフラインで訓練後1.5Bモデル用の新しい最先端の最先端を確立します。

要約(オリジナル)

Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI’s o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples — valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI’s superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.

arxiv情報

著者	Shuyao Xu,Cheng Peng,Jiangxuan Long,Weidi Xu,Wei Chu,Yuan Qi
発行日	2025-05-30 17:47:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー