Safe RLHF: Safe Reinforcement Learning from Human Feedback

要約

大規模言語モデル (LLM) の開発により、AI システムのパフォーマンスと安全性のバランスを取ることがかつてないほど重要になっています。
ただし、有用性と無害性の目的の間に固有の緊張があるため、LLM トレーニング中に大きな課題が生じます。
この問題に対処するために、人間の価値観を一致させるための新しいアルゴリズムである、ヒューマンフィードバックからの安全な強化学習 (Safe RLHF) を提案します。
Safe RLHF は、有用性と無害性に関する人間の好みを明示的に切り離し、緊張に関するクラウドワーカーの混乱を効果的に回避し、個別の報酬モデルとコストモデルをトレーニングできるようにします。
私たちは、LLM の安全性への懸念を、指定されたコスト制約を満たしながら報酬関数を最大化する最適化タスクとして形式化します。
ラグランジュ法を利用してこの制約された問題を解決することで、Safe RLHF は微調整中に 2 つの目的間のバランスを動的に調整します。
Safe RLHF を使用した 3 ラウンドの微調整を通じて、既存の値に合わせたアルゴリズムと比較して、モデルのパフォーマンスを向上させながら、有害な応答を軽減する優れた能力を実証します。
実験的に、Safe RLHF を使用して Alpaca-7B を微調整し、収集された人間の好みに合わせて調整し、人間の評価に従ってその有用性と無害性を大幅に向上させました。

要約(オリジナル)

With the development of large language models (LLMs), striking a balance between the performance and safety of AI systems has never been more critical. However, the inherent tension between the objectives of helpfulness and harmlessness presents a significant challenge during LLM training. To address this issue, we propose Safe Reinforcement Learning from Human Feedback (Safe RLHF), a novel algorithm for human value alignment. Safe RLHF explicitly decouples human preferences regarding helpfulness and harmlessness, effectively avoiding the crowdworkers’ confusion about the tension and allowing us to train separate reward and cost models. We formalize the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints. Leveraging the Lagrangian method to solve this constrained problem, Safe RLHF dynamically adjusts the balance between the two objectives during fine-tuning. Through a three-round fine-tuning using Safe RLHF, we demonstrate a superior ability to mitigate harmful responses while enhancing model performance compared to existing value-aligned algorithms. Experimentally, we fine-tuned the Alpaca-7B using Safe RLHF and aligned it with collected human preferences, significantly improving its helpfulness and harmlessness according to human evaluations.

arxiv情報

著者	Josef Dai,Xuehai Pan,Ruiyang Sun,Jiaming Ji,Xinbo Xu,Mickel Liu,Yizhou Wang,Yaodong Yang
発行日	2023-10-19 14:22:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Safe RLHF: Safe Reinforcement Learning from Human Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー