Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

要約

言語モデル (LLM) に基づく AI アシスタントの成功は、人間の好みにより一致した応答の生成を可能にするヒューマンフィードバックからの強化学習 (RLHF) に大きくかかっています。
汎用 AI アシスタントとして、さまざまなドメインにわたって一貫したパフォーマンスを発揮することへの期待が高まっています。
しかし、これまでの研究では、強化学習 (RL) が高い報酬を得るために近道を悪用し、困難なサンプルを見落とすことが多いことが示されています。
迅速な報酬獲得に重点を置くと、トレーニングの安定性と、新しいまだ見たことのないデータに一般化するモデルの能力の両方が損なわれます。
この研究では、さまざまなデータグループまたはドメインにわたって RL を介して一貫したポリシーを学習できる新しいアプローチを提案します。
グループアノテーションの取得に伴う課題を考慮して、私たちの方法ではデータを自動的に異なるグループに分類し、パフォーマンスの差異を意図的に最大化します。
次に、困難なグループに対して優れたパフォーマンスを発揮するようにポリシーを最適化します。
最後に、確立されたグループを活用することで、私たちのアプローチは探索空間を適応的に調整し、より困難なデータにより多くの学習能力を割り当て、より単純なデータに対するモデルの過剰な最適化を防ぎます。
実験結果は、私たちのアプローチがトレーニングの安定性とモデルの一般化を大幅に強化することを示しています。

要約(オリジナル)

The success of AI assistants based on language models (LLMs) hinges crucially on Reinforcement Learning from Human Feedback (RLHF), which enables the generation of responses more aligned with human preferences. As universal AI assistants, there’s a growing expectation for them to perform consistently across various domains. However, previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples. This focus on quick reward gains undermines both the stability in training and the model’s ability to generalize to new, unseen data. In this work, we propose a novel approach that can learn a consistent policy via RL across various data groups or domains. Given the challenges associated with acquiring group annotations, our method automatically classifies data into different groups, deliberately maximizing performance variance. Then, we optimize the policy to perform well on challenging groups. Lastly, leveraging the established groups, our approach adaptively adjusts the exploration space, allocating more learning capacity to more challenging data and preventing the model from over-optimizing on simpler data. Experimental results indicate that our approach significantly enhances training stability and model generalization.

arxiv情報

著者	Rui Zheng,Wei Shen,Yuan Hua,Wenbin Lai,Shihan Dou,Yuhao Zhou,Zhiheng Xi,Xiao Wang,Haoran Huang,Tao Gui,Qi Zhang,Xuanjing Huang
発行日	2023-10-19 03:14:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Generalization of Alignment with Human Preferences through Group Invariant Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー