Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

要約

監視された微調整（SFT）は、一般的に言語モデルをトレーニングして、指定された指示の注釈付き応答を模倣するために使用されます。
この論文では、このパラダイムに挑戦し、批評微調整（CFT）を提案します。これは、モデルが正しい反応を単に模倣するのではなく、騒々しい反応を批評することを学ぶ戦略です。
批判的思考を強調する人間の学習プロセスに触発されたCFTは、より深い分析と微妙な理解と微妙な理解を奨励しています。
CFTの有効性を検証するために、GPT-4Oを教師として使用して（[クエリ;ノイジーな応答]、批評）の批評を生成する50KサンプルデータセットをWebInStructから構築します。
このデータセットのCFTは、QWEN2.5、QWEN2.5-MATH、DeepSeek-Mathなどのさまざまなベースモデルを使用して、6つの数学ベンチマークでSFTよりも一貫した4〜10％の改善をもたらします。
さらに、メタマスとヌマナマスのデータセットに拡張し、SFTよりも同様の利益を観察します。
特に、モデルQWEN2.5-MATH-CFTでは、5万の例で8xH100で1時間のトレーニングが必要です。
2Mを超えるサンプルを使用するほとんどのベンチマークで、QWEN2.5-MATH-INSTRUCTのような強力な競合他社に匹敵またはアウトパフォームすることができます。
さらに、140倍のコンピューティングでトレーニングされたDeepSeek-R1レプリケーションであるSimplerlのパフォーマンスと一致する可能性があります。
アブレーション研究は、CFTが騒々しい反応と教師批評モデルの原因に堅牢であることを示しています。
これらの調査結果を通じて、CFTは言語モデルの推論を進めるためのより効果的な代替手段を提供すると主張します。

要約(オリジナル)

Supervised Fine-Tuning (SFT) is commonly used to train language models to imitate annotated responses for given instructions. In this paper, we challenge this paradigm and propose Critique Fine-Tuning (CFT), a strategy where models learn to critique noisy responses rather than simply imitate correct ones. Inspired by human learning processes that emphasize critical thinking, CFT encourages deeper analysis and nuanced understanding-traits often overlooked by standard SFT. To validate the effectiveness of CFT, we construct a 50K-sample dataset from WebInstruct, using GPT-4o as the teacher to generate critiques in the form of ([query; noisy response], critique). CFT on this dataset yields a consistent 4-10% improvement over SFT on six math benchmarks with different base models like Qwen2.5, Qwen2.5-Math and DeepSeek-Math. We further expand to MetaMath and NuminaMath datasets and observe similar gains over SFT. Notably, our model Qwen2.5-Math-CFT only requires 1 hour training on 8xH100 over the 50K examples. It can match or outperform strong competitors like Qwen2.5-Math-Instruct on most benchmarks, which use over 2M samples. Moreover, it can match the performance of SimpleRL, which is a deepseek-r1 replication trained with 140x more compute. Ablation studies show that CFT is robust to the source of noisy response and teacher critique model. Through these findings, we argue that CFT offers a more effective alternative to advance the reasoning of language models.

arxiv情報

著者	Yubo Wang,Xiang Yue,Wenhu Chen
発行日	2025-01-30 17:58:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー