WARP: On the Benefits of Weight Averaged Rewarded Policies

要約

ヒューマンフィードバックからの強化学習 (RLHF) は、人間の好みに基づいてトレーニングされた報酬モデルを使用して、その世代に高い報酬を与えることを奨励することで、大規模言語モデル (LLM) を調整します。
事前トレーニングされた知識の忘れを防ぐために、RLHF には通常、KL 正則化が組み込まれています。
これにより、報酬の最適化は妨げられますが、ポリシーは監視付きで微調整された初期化に近い状態に留まることになります。
KL と報酬の間のトレードオフに取り組むために、このホワイトペーパーでは、加重平均報酬ポリシー (WARP) と呼ばれる新しい調整戦略を紹介します。
WARP は、3 つの異なる段階で重み空間内のポリシーをマージします。
まず、ポリシーの指数移動平均を KL 正則化の動的アンカーとして使用します。
2 番目に、球面補間を適用して、個別に微調整されたポリシーを新しい強化されたポリシーにマージします。
3 番目に、このマージされたモデルと初期化の間で線形補間を行い、事前トレーニングから特徴を復元します。
次に、この手順は反復的に適用され、各反復の最終モデルが次の高度な初期化として使用され、KL 報酬パレートフロントが徐々に洗練され、固定 KL で優れた報酬が達成されます。
GEMMA ポリシーを使用した実験により、WARP が品質と整合性を向上させ、他のオープンソース LLM よりも優れたパフォーマンスを発揮することが検証されました。

要約(オリジナル)

Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration’s final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.

arxiv情報

著者	Alexandre Ramé,Johan Ferret,Nino Vieillard,Robert Dadashi,Léonard Hussenot,Pierre-Louis Cedoz,Pier Giuseppe Sessa,Sertan Girgin,Arthur Douillard,Olivier Bachem
発行日	2024-06-24 16:24:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

WARP: On the Benefits of Weight Averaged Rewarded Policies

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー