START: Self-taught Reasoner with Tools

要約

OpenAI-O1やDeepSeek-R1のような大きな推論モデル（LRMS）は、長い考え方（COT）の利用を通じて複雑な推論タスクに顕著な能力を示しています。
ただし、これらのモデルは、内部の推論プロセスのみに依存しているため、幻覚と非効率性に悩まされることがよくあります。
このペーパーでは、外部ツールを活用することで推論機能を大幅に強化する新しいツール統合された長いCOT推論LLMであるStart（Self-Tauged Theanser with Tools）を紹介します。
コードの実行を通じて、STARTは複雑な計算を実行し、セルフチェック、多様な方法の探索、および自己不自由を実行し、LRMSの制限に対処することができます。
Startのコアイノベーションは、2つの重要なテクニックを構成する自己学習フレームワークにあります。1）ヒント：人工的に設計されたヒント（たとえば、「待って、ここでPythonを使用すること」を挿入することを実証します。
ヒントインファーは、シンプルで効果的なシーケンシャルテスト時間スケーリング方法としても機能します。
2）ヒント拒否サンプリング微調整（Hint-RFT）：ヒントRFTは、ヒント軌道をスコアリング、フィルタリング、および変更することにより、ヒントの軌跡をヒントインファーを介してLRMによって生成されたツールの呼び出しを組み合わせて、LRMを微調整することにより、ヒントとRFTを組み合わせます。
このフレームワークを通じて、QWQ-32Bモデルを微調整して開始を達成しました。
PHDレベルの科学QA（GPQA）、競争レベルの数学ベンチマーク（AMC23、AIME24、AIME25）、および競合レベルのコードベンチマーク（LiveCodebench）では、それぞれ63.6％、95.0％、66.7％、47.1％、および47.3％の精度率を達成します。
ベースQWQ-32Bを大幅に上回り、最先端のオープンウェイトモデルR1-Distill-QWEN-32Bおよび独自のモデルO1-Previewに匹敵するパフォーマンスを達成します。

要約(オリジナル)

Large reasoning models (LRMs) like OpenAI-o1 and DeepSeek-R1 have demonstrated remarkable capabilities in complex reasoning tasks through the utilization of long Chain-of-thought (CoT). However, these models often suffer from hallucinations and inefficiencies due to their reliance solely on internal reasoning processes. In this paper, we introduce START (Self-Taught Reasoner with Tools), a novel tool-integrated long CoT reasoning LLM that significantly enhances reasoning capabilities by leveraging external tools. Through code execution, START is capable of performing complex computations, self-checking, exploring diverse methods, and self-debugging, thereby addressing the limitations of LRMs. The core innovation of START lies in its self-learning framework, which comprises two key techniques: 1) Hint-infer: We demonstrate that inserting artificially designed hints (e.g., “Wait, maybe using Python here is a good idea.”) during the inference process of a LRM effectively stimulates its ability to utilize external tools without the need for any demonstration data. Hint-infer can also serve as a simple and effective sequential test-time scaling method; 2) Hint Rejection Sampling Fine-Tuning (Hint-RFT): Hint-RFT combines Hint-infer and RFT by scoring, filtering, and modifying the reasoning trajectories with tool invocation generated by a LRM via Hint-infer, followed by fine-tuning the LRM. Through this framework, we have fine-tuned the QwQ-32B model to achieve START. On PhD-level science QA (GPQA), competition-level math benchmarks (AMC23, AIME24, AIME25), and the competition-level code benchmark (LiveCodeBench), START achieves accuracy rates of 63.6%, 95.0%, 66.7%, 47.1%, and 47.3%, respectively. It significantly outperforms the base QwQ-32B and achieves performance comparable to the state-of-the-art open-weight model R1-Distill-Qwen-32B and the proprietary model o1-Preview.

arxiv情報

著者	Chengpeng Li,Mingfeng Xue,Zhenru Zhang,Jiaxi Yang,Beichen Zhang,Xiang Wang,Bowen Yu,Binyuan Hui,Junyang Lin,Dayiheng Liu
発行日	2025-03-07 18:13:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

START: Self-taught Reasoner with Tools

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー