Through the Valley: Path to Effective Long CoT Training for Small Language Models

要約

長い考え方（COT）の監督は、言語モデルの推論を強化するための一般的な戦略となっています。
大規模なモデルには効果的ですが、限られた長いCOTデータを訓練した小さな言語モデル（SLM; <= 3Bパラメーター）を経験する長いCOT劣化と呼ばれる現象を特定します。 QWEN2.5、LLAMA3、およびGEMMA3ファミリーに関する広範な実験を通じて、この劣化がSLM全体で広まっていることを実証します。一部の設定では、微調整前に8kの長さのコットの例でトレーニングされたモデルは、元のパフォーマンスの最大75％を失います。驚くべきことに、いくつかの特に小さなモデルでは、220Kの長いCOTの例でトレーニングでさえ、微調整前に元のパフォーマンスを回復または上回ることができないことを観察します。私たちの分析は、この効果をエラーの蓄積に帰します。応答が長くなると、マルチステップ推論の能力が向上しますが、間違いを悪化させるリスクも増幅します。さらに、長いCOTの分解は、下流の補強学習（RL）に悪影響を与える可能性があることがわかりますが、これは十分にスケーリングされた監視された微調整（SFT）によって緩和される可能性があります。私たちの調査結果は、SLMSの長いCOTトレーニングの利点に関する一般的な仮定に挑戦し、より効果的な小規模推論モデルを構築するための実用的なガイダンスを提供します。

要約(オリジナル)

Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models. While effective for large models, we identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs; <=3B parameters) trained on limited long CoT data experience significant performance deterioration. Through extensive experiments on the Qwen2.5, LLaMA3 and Gemma3 families, we demonstrate that this degradation is widespread across SLMs. In some settings, models trained on only 8k long CoT examples lose up to 75% of their original performance before fine-tuning. Strikingly, we further observe that for some particularly small models, even training on 220k long CoT examples fails to recover or surpass their original performance prior to fine-tuning. Our analysis attributes this effect to error accumulation: while longer responses increase the capacity for multi-step reasoning, they also amplify the risk of compounding mistakes. Furthermore, we find that Long CoT Degradation may negatively impacts downstream reinforcement learning (RL), although this can be alleviated by sufficiently scaled supervised fine-tuning (SFT). Our findings challenge common assumptions about the benefits of long CoT training for SLMs and offer practical guidance for building more effective small-scale reasoning models.

arxiv情報

著者	Renjie Luo,Jiaxi Li,Chen Huang,Wei Lu
発行日	2025-06-09 12:56:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Through the Valley: Path to Effective Long CoT Training for Small Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー