RRHF: Rank Responses to Align Language Models with Human Feedback without tears

要約

ヒューマンフィードバックからの強化学習 (RLHF) は、大規模な言語モデルを人間の好みに合わせて調整することを容易にし、人間とこれらのモデルの間の対話の質を大幅に向上させます。
InstructGPT は、教師あり微調整 (SFT)、報酬モデルトレーニング、近接ポリシー最適化 (PPO) などのいくつかの段階を通じて RLHF を実装します。
ただし、PPO はハイパーパラメータの影響を受けやすく、標準実装では少なくとも 4 つのモデルが必要なため、トレーニングが困難になります。
対照的に、我々は、RRHF と呼ばれる新しい学習パラダイムを提案します。これは、さまざまなサンプリングポリシーによって生成された応答をスコア化し、順位付け損失を通じてそれらを人間の好みに合わせる方法を学習します。
RRHF は、言語モデルの出力確率を人間の好みに合わせて微調整と同じくらい堅牢に効率的に調整でき、調整中に必要なモデルは 1 ～ 2 つだけです。
さらに、RRHF は SFT および報酬モデルの拡張であると考えることができますが、コーディング、モデル数、ハイパーパラメーターの点では PPO よりも単純です。
アライメントプロセス全体は、単一の RRHF トレーニングセッション内で完了できます。
LLaMA と Alpaca を使用して有用データと無害データで RRHF を評価し、PPO と同等のパフォーマンスを実証しました。

要約(オリジナル)

Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models with human preferences, significantly enhancing the quality of interactions between humans and these models. InstructGPT implements RLHF through several stages, including Supervised Fine-Tuning (SFT), reward model training, and Proximal Policy Optimization (PPO). PPO, however, is sensitive to hyperparameters and requires a minimum of four models in its standard implementation, which makes it hard to train. In contrast, we propose a novel learning paradigm called RRHF, which scores responses generated by different sampling policies and learns to align them with human preferences through ranking loss. RRHF can efficiently align language model output probabilities with human preferences as robust as fine-tuning and it only needs 1 to 2 models during tuning. In addition, RRHF can be considered an extension of SFT and reward models while being simpler than PPO in terms of coding, model counts, and hyperparameters. The entire alignment process can be accomplished within a single RRHF training session. We evaluate RRHF using LLaMA and Alpaca on Helpful and Harmless data, demonstrating performance comparable to PPO.

arxiv情報

著者	Zheng Yuan,Hongyi Yuan,Chuanqi Tan,Wei Wang,Songfang Huang,Fei Huang
発行日	2023-05-22 17:27:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RRHF: Rank Responses to Align Language Models with Human Feedback without tears

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー