Group Robust Preference Optimization in Reward-free RLHF

要約

大規模言語モデル (LLM) を特定のタスクに適応させるには、通常、嗜好データに関するヒューマンフィードバック (RLHF) による強化学習による微調整が必要になります。
これらのデータは多くの場合、多様なラベラーのグループ (さまざまな人口統計、民族性、企業チームなど) から取得されますが、従来の RLHF アプローチは「画一的な」アプローチを採用しています。つまり、単一のデータを無差別に仮定して最適化します。
したがって、さまざまなグループの固有の特性やニーズに対して堅牢ではありません。
この制限に対処するために、LLM を個々のグループの好みに確実に合わせるための新しいグループロバストプリファレンス最適化 (GRPO) 手法を提案します。
私たちのアプローチは、報酬なしの直接選好最適化手法に基づいていますが、これまでのアプローチとは異なり、最悪の場合のグループのパフォーマンスを最大化する堅牢なポリシーを求めています。
これを達成するために、GRPO はさまざまなグループの重要性に適応的かつ連続的に重み付けを行い、累積損失がより悪いグループを優先します。
私たちは GRPO の実現可能性を理論的に研究し、対数線形政策クラスに対するその収束を分析します。
多様なグループベースの世界的な意見データを使用して GRPO で LLM を微調整することで、最もパフォーマンスの悪いグループのパフォーマンスが大幅に向上し、グループ間の損失の不均衡が減少し、ロバストでないベースラインと比較して確率の精度が向上しました。

要約(オリジナル)

Adapting large language models (LLMs) for specific tasks usually involves fine-tuning through reinforcement learning with human feedback (RLHF) on preference data. While these data often come from diverse labelers’ groups (e.g., different demographics, ethnicities, company teams, etc.), traditional RLHF approaches adopt a ‘one-size-fits-all’ approach, i.e., they indiscriminately assume and optimize a single preference model, thus not being robust to unique characteristics and needs of the various groups. To address this limitation, we propose a novel Group Robust Preference Optimization (GRPO) method to align LLMs to individual groups’ preferences robustly. Our approach builds upon reward-free direct preference optimization methods, but unlike previous approaches, it seeks a robust policy which maximizes the worst-case group performance. To achieve this, GRPO adaptively and sequentially weights the importance of different groups, prioritizing groups with worse cumulative loss. We theoretically study the feasibility of GRPO and analyze its convergence for the log-linear policy class. By fine-tuning LLMs with GRPO using diverse group-based global opinion data, we significantly improved performance for the worst-performing groups, reduced loss imbalances across groups, and improved probability accuracies compared to non-robust baselines.

arxiv情報

著者	Shyam Sundhar Ramesh,Yifan Hu,Iason Chaimalas,Viraj Mehta,Pier Giuseppe Sessa,Haitham Bou Ammar,Ilija Bogunovic
発行日	2024-05-30 17:50:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Group Robust Preference Optimization in Reward-free RLHF

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー