UFT: Unifying Supervised and Reinforcement Fine-Tuning

要約

トレーニング後は、大規模な言語モデル（LLM）の推論能力を高める上でその重要性を示しています。
主要なトレーニング後の方法は、監視付き微調整（SFT）および補強微調整（RFT）に分類できます。
SFTは効率的であり、小言語モデルには適していますが、大規模なモデルの推論能力を過剰装着し、制限する可能性があります。
対照的に、RFTは一般により良い一般化をもたらしますが、基本モデルの強度に大きく依存します。
SFTとRFTの制限に対処するために、SFTとRFTを単一の統合プロセスに統合する新しいトレーニング後のパラダイムである統一された微調整（UFT）を提案します。
UFTにより、モデルは有益な監督シグナルを組み込み、既存の方法の根底にある思考のギャップを埋めながら、ソリューションを効果的に探索できます。
特に、UFTは、モデルサイズに関係なく、一般にSFTとRFTの両方を上回ることです。
さらに、UFTがRFTの固有の指数関数的なサンプルの複雑さのボトルネックを破ることを理論的に証明し、統一されたトレーニングが長期の推論のタスクの収束を指数関数的に加速できることを初めて示します。

要約(オリジナル)

Post-training has demonstrated its importance in enhancing the reasoning capabilities of large language models (LLMs). The primary post-training methods can be categorized into supervised fine-tuning (SFT) and reinforcement fine-tuning (RFT). SFT is efficient and well-suited for small language models, but it may lead to overfitting and limit the reasoning abilities of larger models. In contrast, RFT generally yields better generalization but depends heavily on the strength of the base model. To address the limitations of SFT and RFT, we propose Unified Fine-Tuning (UFT), a novel post-training paradigm that unifies SFT and RFT into a single, integrated process. UFT enables the model to effectively explore solutions while incorporating informative supervision signals, bridging the gap between memorizing and thinking underlying existing methods. Notably, UFT outperforms both SFT and RFT in general, regardless of model sizes. Furthermore, we theoretically prove that UFT breaks RFT’s inherent exponential sample complexity bottleneck, showing for the first time that unified training can exponentially accelerate convergence on long-horizon reasoning tasks.

arxiv情報

著者	Mingyang Liu,Gabriele Farina,Asuman Ozdaglar
発行日	2025-05-22 17:53:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

UFT: Unifying Supervised and Reinforcement Fine-Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー