Vanishing Gradients in Reinforcement Finetuning of Language Models

要約

事前トレーニングされた言語モデルは通常、強化微調整 (RFT) によって人間の好みや下流のタスクと調整されます。これには、ポリシー勾配アルゴリズムを使用して (おそらく学習された) 報酬関数を最大化することが伴います。
この研究は、RFT における根本的な最適化の障害を浮き彫りにしています。モデルの下での報酬の標準偏差が小さい場合、たとえ期待される報酬が最適から程遠い場合でも、入力の期待される勾配は消失することを証明します。
次に、RFT ベンチマークと制御された環境での実験、および理論的分析を通じて、小さな報酬標準偏差による消失勾配が蔓延しており、有害であり、報酬の最大化が非常に遅くなることを示します。
最後に、RFT での勾配消失を克服する方法を検討します。
私たちは、初期の教師付き微調整 (SFT) フェーズの一般的な手法が最も有望な候補であることを発見し、これにより RFT パイプラインにおけるその重要性が明らかになります。
さらに、入力サンプルのわずか 1% に対する比較的少数の SFT 最適化ステップで十分であることを示し、最初の SFT フェーズが計算とデータのラベル付け作業の点で高価である必要がないことを示しています。
全体として、私たちの結果は、報酬標準偏差によって測定される、予想される勾配が消える入力に注意することが、RFT の実行を成功させるために重要であることを強調しています。

要約(オリジナル)

Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which entails maximizing a (possibly learned) reward function using policy gradient algorithms. This work highlights a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model is small, even if the expected reward is far from optimal. Through experiments on an RFT benchmark and controlled environments, as well as a theoretical analysis, we then demonstrate that vanishing gradients due to small reward standard deviation are prevalent and detrimental, leading to extremely slow reward maximization. Lastly, we explore ways to overcome vanishing gradients in RFT. We find the common practice of an initial supervised finetuning (SFT) phase to be the most promising candidate, which sheds light on its importance in an RFT pipeline. Moreover, we show that a relatively small number of SFT optimization steps on as few as 1% of the input samples can suffice, indicating that the initial SFT phase need not be expensive in terms of compute and data labeling efforts. Overall, our results emphasize that being mindful for inputs whose expected gradient vanishes, as measured by the reward standard deviation, is crucial for successful execution of RFT.

arxiv情報

著者	Noam Razin,Hattie Zhou,Omid Saremi,Vimal Thilak,Arwen Bradley,Preetum Nakkiran,Joshua Susskind,Etai Littwin
発行日	2023-10-31 17:59:05+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Vanishing Gradients in Reinforcement Finetuning of Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー