Preference Ranking Optimization for Human Alignment

要約

大規模言語モデル (LLM) には誤解を招くコンテンツが含まれることが多く、安全な AI システムを確保するためにモデルを人間の価値観と一致させる必要性が強調されています。
この調整を達成するために、ヒューマンフィードバックからの強化学習 (RLHF) が採用されています。
ただし、これには 2 つの主な欠点があります。 (1) RLHF は、SFT とは対照的に、複雑さ、不安定性、およびハイパーパラメータに対する感度を示します。
(2) 大規模な試行錯誤にもかかわらず、複数のサンプリングはペアごとのコントラストに削減されるため、マクロの観点からはコントラストが不足します。
この論文では、人間の調整のために LLM を直接微調整するための効率的な SFT アルゴリズムとして、優先順位最適化 (PRO) を提案します。
PRO は、ペアごとのコントラストを拡張して、任意の長さの優先順位に対応します。
PRO は、候補を反復的に対比することにより、残りの応答を段階的にランク付けしながら、最良の応答を優先するように LLM に指示します。
このように、PRO は人間の調整を効果的に変換して、LLM によって生成された n 個の応答の確率ランキングを、これらの応答に対する人間の優先順位に合わせて調整します。
実験では、PRO がベースラインアルゴリズムを上回り、自動ベース、報酬ベース、GPT-4、および人間の評価を通じて、ChatGPT および人間の応答と同等の結果を達成することが示されました。

要約(オリジナル)

Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values to ensure secure AI systems. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. However, it encompasses two main drawbacks: (1) RLHF exhibits complexity, instability, and sensitivity to hyperparameters in contrast to SFT. (2) Despite massive trial-and-error, multiple sampling is reduced to pair-wise contrast, thus lacking contrasts from a macro perspective. In this paper, we propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to directly fine-tune LLMs for human alignment. PRO extends the pair-wise contrast to accommodate preference rankings of any length. By iteratively contrasting candidates, PRO instructs the LLM to prioritize the best response while progressively ranking the rest responses. In this manner, PRO effectively transforms human alignment into aligning the probability ranking of n responses generated by LLM with the preference ranking of humans towards these responses. Experiments have shown that PRO outperforms baseline algorithms, achieving comparable results to ChatGPT and human responses through automatic-based, reward-based, GPT-4, and human evaluations.

arxiv情報

著者	Feifan Song,Bowen Yu,Minghao Li,Haiyang Yu,Fei Huang,Yongbin Li,Houfeng Wang
発行日	2024-02-27 18:42:42+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Preference Ranking Optimization for Human Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー