Incentivizing Reasoning from Weak Supervision

要約

大規模な言語モデル（LLMS）は、推論集約型タスクの印象的なパフォーマンスを実証していますが、推論能力の向上は、通常、検証可能な信号を備えた補強学習（RL）または高品質の長い概要（COT）デモンストレーションを備えた監視付き微調整（SFT）のいずれかに依存しています。
この論文では、高価な高品質のデモンストレーションと強化学習なしでLLMの推論能力を奨励するという新しい問題を研究します。
LLMの推論能力が、大幅に弱いモデルからの監督を介して効果的にインセンティブ化できるかどうかを調査します。
さらに、このような弱い監督が、より強力なモデルで推論能力を引き出すことに成功する時期と理由を分析します。
私たちの調査結果は、大幅に弱い推論者からの監督が学生の推論パフォーマンスを大幅に改善し、数分の1のコストで高価なRLの利益の94％近くを回復できることを示しています。
多様なベンチマークとモデルアーキテクチャの実験は、弱い推論者がより強力な学生モデルの推論を効果的に奨励し、幅広い推論タスクのパフォーマンスを一貫して改善できることを示しています。
私たちの結果は、この単純な弱いパラダイムが、LLMSの推論時に強い推論能力を奨励するための費用のかかる方法の有望で一般化可能な代替手段であることを示唆しています。
このコードは、https://github.com/yuanyige/w2srで公開されています。

要約(オリジナル)

Large language models (LLMs) have demonstrated impressive performance on reasoning-intensive tasks, but enhancing their reasoning abilities typically relies on either reinforcement learning (RL) with verifiable signals or supervised fine-tuning (SFT) with high-quality long chain-of-thought (CoT) demonstrations, both of which are expensive. In this paper, we study a novel problem of incentivizing the reasoning capacity of LLMs without expensive high-quality demonstrations and reinforcement learning. We investigate whether the reasoning capabilities of LLMs can be effectively incentivized via supervision from significantly weaker models. We further analyze when and why such weak supervision succeeds in eliciting reasoning abilities in stronger models. Our findings show that supervision from significantly weaker reasoners can substantially improve student reasoning performance, recovering close to 94% of the gains of expensive RL at a fraction of the cost. Experiments across diverse benchmarks and model architectures demonstrate that weak reasoners can effectively incentivize reasoning in stronger student models, consistently improving performance across a wide range of reasoning tasks. Our results suggest that this simple weak-to-strong paradigm is a promising and generalizable alternative to costly methods for incentivizing strong reasoning capabilities at inference-time in LLMs. The code is publicly available at https://github.com/yuanyige/W2SR.

arxiv情報

著者	Yige Yuan,Teng Xiao,Shuchang Tao,Xue Wang,Jinyang Gao,Bolin Ding,Bingbing Xu
発行日	2025-05-26 14:51:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Incentivizing Reasoning from Weak Supervision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー