Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

要約

人間の好みと価値観を一致させることは、現代の基盤モデルと具体化された AI を構築するための重要な要件です。
ただし、ヒューマンフィードバックによる強化学習 (RLHF) などの一般的なアプローチでは、タスクを教師あり微調整 (SFT)、報酬モデリング (RM)、強化学習 (RL) などの連続した段階に分割し、それぞれが 1 つの特定の学習を実行します。
タスク。
このような逐次的なアプローチは、データの大幅な活用不足や、学習された報酬モデルと生成されたポリシー間の分布の不一致などの深刻な問題を引き起こし、最終的には調整パフォーマンスの低下につながります。
私たちは、人間の好みとデモンストレーションの両方を統合して報酬モデルとポリシーをトレーニングできる、統合ヒューマンフィードバックとの調整 (AIHF) という単一段階のアプローチを開発します。
提案されたアプローチでは、RLHF や Directly Policy Optimization (DPO) などの一般的な調整アルゴリズムを簡単に削減して活用できる一連の効率的なアルゴリズムが認められ、既存の調整パイプラインにわずかな変更を加えるだけで済みます。
LLM のアライメント問題と MuJoCo のロボット制御問題を含む広範な実験により、提案されたソリューションの効率性を実証します。
特に高品質の嗜好データの量が比較的限られている場合、提案されたソリューションは、RLHF や DPO などの既存の調整アルゴリズムよりも大幅に優れていることがわかります。

要約(オリジナル)

Aligning human preference and value is an important requirement for building contemporary foundation models and embodied AI. However, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into successive stages, such as supervised fine-tuning (SFT), reward modeling (RM), and reinforcement learning (RL), each performing one specific learning task. Such a sequential approach results in serious issues such as significant under-utilization of data and distribution mismatch between the learned reward model and generated policy, which eventually lead to poor alignment performance. We develop a single stage approach named Alignment with Integrated Human Feedback (AIHF), capable of integrating both human preference and demonstration to train reward models and the policy. The proposed approach admits a suite of efficient algorithms, which can easily reduce to, and leverage, popular alignment algorithms such as RLHF and Directly Policy Optimization (DPO), and only requires minor changes to the existing alignment pipelines. We demonstrate the efficiency of the proposed solutions with extensive experiments involving alignment problems in LLMs and robotic control problems in MuJoCo. We observe that the proposed solutions outperform the existing alignment algorithms such as RLHF and DPO by large margins, especially when the amount of high-quality preference data is relatively limited.

arxiv情報

著者	Chenliang Li,Siliang Zeng,Zeyi Liao,Jiaxiang Li,Dongyeop Kang,Alfredo Garcia,Mingyi Hong
発行日	2024-11-29 23:41:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning Reward and Policy Jointly from Demonstration and Preference Improves Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー