Aligning Language Models with Offline Reinforcement Learning from Human Feedback

要約

人間の好みから学習することは、言語モデル (LM) が人間のニーズと社会的価値観に効果的に応えるために非常に重要です。
これまでの研究は、指示に従うために人間のフィードバックを活用することで顕著な進歩を遂げました。
ただし、これらのアプローチは主に Proximal Policy Optimization (PPO) などのオンライン強化学習 (RL) 手法に依存しており、不安定で言語モデルに合わせて調整するのが難しいことが判明しています。
さらに、PPO では複雑な分散システムの実装が必要となり、大規模な分散トレーニングの効率性が妨げられます。
この研究では、RL 環境と対話することなく、事前に生成されたサンプルを使用して LM を調整するための、ヒューマンフィードバックからのオフライン強化学習 (RLHF) フレームワークを提案します。
具体的には、フィルタリングを使用した最尤推定 (MLE)、報酬加重回帰 (RWR)、および言語モデルを人間の好みに合わせるための意思決定変換器 (DT) を検討します。
教師あり微調整と同様の損失関数を採用することで、私たちの方法は、単純な機械学習システム (MLSys) とはるかに少ない (約 12.3\%) コンピューティングリソースを使用した PPO よりも安定したモデルトレーニングを保証します。
実験結果は、DT アライメントが他のオフライン RLHF 方法よりも優れており、PPO よりも優れていることを示しています。

要約(オリジナル)

Learning from human preferences is crucial for language models (LMs) to effectively cater to human needs and societal values. Previous research has made notable progress by leveraging human feedback to follow instructions. However, these approaches rely primarily on online reinforcement learning (RL) techniques like Proximal Policy Optimization (PPO), which have been proven unstable and challenging to tune for language models. Moreover, PPO requires complex distributed system implementation, hindering the efficiency of large-scale distributed training. In this study, we propose an offline reinforcement learning from human feedback (RLHF) framework to align LMs using pre-generated samples without interacting with RL environments. Specifically, we explore maximum likelihood estimation (MLE) with filtering, reward-weighted regression (RWR), and Decision Transformer (DT) to align language models to human preferences. By employing a loss function similar to supervised fine-tuning, our methods ensure more stable model training than PPO with a simple machine learning system~(MLSys) and much fewer (around 12.3\%) computing resources. Experimental results demonstrate the DT alignment outperforms other Offline RLHF methods and is better than PPO.

arxiv情報

著者	Jian Hu,Li Tao,June Yang,Chandler Zhou
発行日	2023-08-23 10:41:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Aligning Language Models with Offline Reinforcement Learning from Human Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー