RRHF: Rank Responses to Align Language Models with Human Feedback without tears

要約

タイトル: RRHF: Rank Responses to Align Language Models with Human Feedback without tears

要約:
– 大規模な言語モデルと人の好みを合わせることを可能にするRLHF(Reinforcement Learning from Human Feedback)が提案されている。
– RLHFを実装するInstructGPTは、SFT(Supervised Fine-Tuning)、報酬モデルのトレーニング、そして近接方策最適化(PPO)などのステップが含まれるが、通常の実装ではPPOはハイパラメータに敏感で、4つのモデルが必要であり、トレーニングが難しい。
– これに対して、RRHF(Rank Responses to Align Language Models with Human Feedback)が提案された。RRHFは、異なるサンプリングポリシーで生成されたレスポンスをスコアリングし、ランキング損失を通じてそれらを人の好みに合わせることを学習する新しい学習パラダイムである。
– RRHFは、ファインチューニングに匹敵する強固な性能で言語モデルの出力確率を効率的に人の嗜好に合わせることができ、調整中に1〜2つのモデルしか必要としない。また、PPOよりもコーディング、モデル数、ハイパラメータの面で簡単である。
– RRHFトレーニングセッション内で完全なアラインメントプロセスを実現できる。
– LLaMAとAlpacaのHelpful and Harmlessデータを使用してRRHFを評価し、PPOと同等の性能を発揮することを示した。

要約(オリジナル)

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and these models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). PPO, however, is sensitive to hyperparameters and requires a minimum of four models in its standard implementation, which makes it hard to train. In contrast, we propose a novel learning paradigm called RRHF, which scores responses generated by different sampling policies and learns to align them with human preferences through ranking loss. RRHF can efficiently align language model output probabilities with human preferences as robust as fine-tuning and it only needs 1 to 2 models during tuning. In addition, RRHF can be considered an extension of SFT and reward models while being simpler than PPO in terms of coding, model counts, and hyperparameters. The entire alignment process can be accomplished within a single RRHF training session. We evaluate RRHF using LLaMA and Alpaca on Helpful and Harmless data, demonstrating performance comparable to PPO.

arxiv情報

著者	Zheng Yuan,Hongyi Yuan,Chuanqi Tan,Wei Wang,Songfang Huang,Fei Huang
発行日	2023-04-11 15:53:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー