s1: Simple test-time scaling

要約

テスト・タイム・スケーリングは、言語モデリングに対する有望な新しいアプローチであり、性能向上のために余分なテスト時間の計算を使用する。最近、OpenAIのo1モデルはこの能力を示したが、その方法論を公開しなかったため、多くの複製努力が行われた。我々は、テスト時間のスケーリングと強力な推論性能を達成するための最も単純なアプローチを模索する。第一に、我々は、難易度、多様性、品質という3つの基準に基づき、推論トレースとペアになった1,000の質問からなる小さなデータセットs1Kを作成する。第二に、モデルの思考プロセスを強制的に終了させるか、終了しようとするモデルの生成に’Wait’を複数回付加することでそれを長くすることで、テスト時間の計算を制御する予算強制を開発する。これによってモデルは答えを再チェックし、しばしば間違った推論ステップを修正する。s1K上でQwen2.5-32B-Instruct言語モデルをスーパーバイズド・ファインチューニングし、バジェット・フォーシングを装備した後、我々のモデルs1-32Bは、競技の数学問題でo1-previewを最大27%上回りました(MATHとAIME24)。さらに、s1-32Bをバジェットフォースでスケーリングすることで、テスト時間の介入なしにそのパフォーマンスを超えることができます。我々のモデル、データ、コードは https://github.com/simplescaling/s1 でオープンソースとして公開されている。

要約(オリジナル)

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI’s o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model’s thinking process or lengthening it by appending ‘Wait’ multiple times to the model’s generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

arxiv情報

著者	Niklas Muennighoff,Zitong Yang,Weijia Shi,Xiang Lisa Li,Li Fei-Fei,Hannaneh Hajishirzi,Luke Zettlemoyer,Percy Liang,Emmanuel Candès,Tatsunori Hashimoto
発行日	2025-02-03 16:31:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

s1: Simple test-time scaling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー