Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

要約

強化学習（RL）は、トレーニング後の大規模な言語モデル（LLM）の重要なコンポーネントです。
ただし、トレーニング後に使用される既存のオンポリシーアルゴリズムは、エクスペリエンスリプレイバッファーの使用と本質的に互換性がありません。リプレイバッファーは、分散型オフポリシーアクターによってスケーラルに入力できるように、計算の増加として探索を強化できます。
非常にスケーラブルなLLM RLシステムである非同期バランス（TBA）との軌道バランスを介して、リプレイバッファーのこの利点を効率的に取得することを提案します。
既存のアプローチとは対照的に、TBAは検索時に大量の計算を使用し、常に中央のリプレイバッファーのポリシー外データを常に生成します。
トレーニングノードは、報酬または最新性に基づいてこのバッファーからのデータをサンプリングして、Gflownetsに導入された多様性を求めるRL目標であるTrajectory Balance（TB）を使用してポリシーを更新します。
TBAには3つの重要な利点があります。（1）デカップされたトレーニングと検索、トレーニングの壁1杯の時間を4倍以上高速化します。
（2）大規模なオフポリシーサンプリングによる多様性の改善。
（3）まばらな報酬設定のスケーラブルな検索。
数学的な推論、優先順位、および自動化されたレッドチーミング（多様で代表的なトレーニング後のタスク）について、TBAは強力なベースラインよりも速度とパフォーマンスの改善を生み出します。

要約(オリジナル)

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.

arxiv情報

著者	Brian R. Bartoldson,Siddarth Venkatraman,James Diffenderfer,Moksh Jain,Tal Ben-Nun,Seanie Lee,Minsu Kim,Johan Obando-Ceron,Yoshua Bengio,Bhavya Kailkhura
発行日	2025-03-24 17:51:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー