AdaptThink: Reasoning Models Can Learn When to Think

要約

最近、大規模な推論モデルは、人間のような深い思考を採用することにより、さまざまなタスクで印象的なパフォーマンスを達成しました。
ただし、長い思考プロセスにより、buter症のオーバーヘッドが大幅に増加し、効率を重要なボトルネックにします。
この作業では、推論モデルが思考をスキップし、最終的なソリューションを直接生成するように促す無知なものが、パフォーマンスと効率の両方の点で比較的単純なタスクに適していることを最初に示します。
これにより動機付けられているため、問題の難易度に基づいて最適な思考モードを適応的に選択するように推論モデルを教えるための新しいRLアルゴリズムであるAdapthinkを提案します。
具体的には、AdaptHinkには2つのコアコンポーネントがあります。（1）全体的なパフォーマンスを維持しながら、モデルが無効を選択することを奨励する制約された最適化目標。
（2）ポリシーでのトレーニング中に思考と非難のサンプルのバランスをとる重要なサンプリング戦略により、コールドスタートを可能にし、モデルがトレーニングプロセス全体で両方の思考モードを探索して活用できるようにします。
私たちの実験は、Adapthinkが推論コストを大幅に削減し、パフォーマンスをさらに向上させることを示しています。
特に、3つの数学データセットで、AdaptHinkはDeepSeek-R1-Distill-Qwen-1.5Bの平均応答長を53％削減し、その精度を2.4％改善し、推論の質と効率のバランスを最適化するための適応型思考モード選択の約束を強調します。
私たちのコードとモデルは、https：//github.com/thu-keg/adaptthinkで入手できます。

要約(オリジナル)

Recently, large reasoning models have achieved impressive performance on various tasks by employing human-like deep thinking. However, the lengthy thinking process substantially increases inference overhead, making efficiency a critical bottleneck. In this work, we first demonstrate that NoThinking, which prompts the reasoning model to skip thinking and directly generate the final solution, is a better choice for relatively simple tasks in terms of both performance and efficiency. Motivated by this, we propose AdaptThink, a novel RL algorithm to teach reasoning models to choose the optimal thinking mode adaptively based on problem difficulty. Specifically, AdaptThink features two core components: (1) a constrained optimization objective that encourages the model to choose NoThinking while maintaining the overall performance; (2) an importance sampling strategy that balances Thinking and NoThinking samples during on-policy training, thereby enabling cold start and allowing the model to explore and exploit both thinking modes throughout the training process. Our experiments indicate that AdaptThink significantly reduces the inference costs while further enhancing performance. Notably, on three math datasets, AdaptThink reduces the average response length of DeepSeek-R1-Distill-Qwen-1.5B by 53% and improves its accuracy by 2.4%, highlighting the promise of adaptive thinking-mode selection for optimizing the balance between reasoning quality and efficiency. Our codes and models are available at https://github.com/THU-KEG/AdaptThink.

arxiv情報

著者	Jiajie Zhang,Nianyi Lin,Lei Hou,Ling Feng,Juanzi Li
発行日	2025-05-19 17:50:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AdaptThink: Reasoning Models Can Learn When to Think

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー