Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

要約

大規模な言語モデル（LLMS）のチェーンオブテーブ（COT）推論は、潜在的な可変問題として正式化できます。ここでは、モデルが中間推論ステップを生成する必要があります。
反復報酬ランクの微調整（RAFT）などの以前のアプローチはそのような製剤に依存していますが、通常、プロンプトに均一な推論予算を適用します。
この作業は、COTトレーニングの主要なボトルネックを、静的なサンプリング戦略による非効率的な確率的勾配推定として特定しています。
計算予算の制約の下で確率的勾配分散を最小限に抑えるために設計されたプロンプト固有の動的サンプル割り当て戦略であるGVMラフトを提案します。
このメソッドは、迅速な受け入れ率と確率的勾配規範を監視することにより、計算リソースを動的に割り当て、結果として得られる勾配分散が最小化されるようにします。
私たちの理論分析は、提案された動的サンプリング戦略が適切な条件下での加速収束保証につながることを示しています。
数学的推論に関する実験は、GVMラフトがバニララフトよりも2〜4倍のスピードアップとかなりの精度の改善を達成することを示しています。
提案された動的サンプリング戦略は一般的であり、GRPOなどの他の強化学習アルゴリズムに組み込むことができ、収束とテストの精度も同様の改善につながります。
私たちのコードは、https：//github.com/rlhflow/gvmで入手できます。

要約(オリジナル)

Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at https://github.com/RLHFlow/GVM.

arxiv情報

著者	Jiarui Yao,Yifan Hao,Hanning Zhang,Hanze Dong,Wei Xiong,Nan Jiang,Tong Zhang
発行日	2025-05-05 06:26:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー