Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

要約

強化学習（RL）は、推論モデルをトレーニングするための効果的な方法として浮上しています。
ただし、既存のRLアプローチは通常、外部の知識を導入することなく、モデルの出力分布を報酬最大化パスに向けてバイアスします。
これにより、探索能力が制限され、ベースモデルと比較してより狭い推論能力境界が得られます。
この制限に対処するために、外部の高レベルガイダンス（「思考パターン」）を組み込むことでRLを補強する新しいフレームワークであるTAPO（思考能力のある政策最適化）を提案します。
トレーニング中に構造化された思考を適応的に統合することにより、TAPOはモデル内部探査と外部ガイダンスの搾取を効果的にバランスさせます。
広範な実験により、私たちのアプローチは、AIMEでGRPOを99％、AMCで41％、Minerva Mathで17％を大幅に上回ることが示されています。
特に、これらの高レベルの思考パターンは、わずか500の以前のサンプルから抽象化されており、さまざまなタスクやモデルに効果的に一般化されています。
これは、複数のタスクとドメインにわたるより広範なアプリケーションのTapoの可能性を強調しています。
さらなる分析により、外部ガイダンスを導入することで、推論行動の優れた説明可能性と出力読み取り可能性が向上した強力な推論モデルが生成されることが明らかになりました。

要約(オリジナル)

Reinforcement learning (RL) has emerged as an effective method for training reasoning models. However, existing RL approaches typically bias the model’s output distribution toward reward-maximizing paths without introducing external knowledge. This limits their exploration capacity and results in a narrower reasoning capability boundary compared to base models. To address this limitation, we propose TAPO (Thought-Augmented Policy Optimization), a novel framework that augments RL by incorporating external high-level guidance (‘thought patterns’). By adaptively integrating structured thoughts during training, TAPO effectively balances model-internal exploration and external guidance exploitation. Extensive experiments show that our approach significantly outperforms GRPO by 99% on AIME, 41% on AMC, and 17% on Minerva Math. Notably, these high-level thought patterns, abstracted from only 500 prior samples, generalize effectively across various tasks and models. This highlights TAPO’s potential for broader applications across multiple tasks and domains. Our further analysis reveals that introducing external guidance produces powerful reasoning models with superior explainability of inference behavior and enhanced output readability.

arxiv情報

著者	Jinyang Wu,Chonghua Liao,Mingkuan Feng,Shuai Zhang,Zhengqi Wen,Pengpeng Shao,Huazhe Xu,Jianhua Tao
発行日	2025-05-21 16:06:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Thought-Augmented Policy Optimization: Bridging External Guidance and Internal Capabilities

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー