Mitigating Reward Hacking via Information-Theoretic Reward Modeling

要約

言語モデルを人間の価値観に合わせてヒューマンフィードバックからの強化学習（RLHF）が成功したにもかかわらず、報酬の過剰最適化とも呼ばれる報酬ハッキングは依然として重大な課題であり、これは主に報酬モデリングの制限、つまり報酬モデルの一般化可能性、および報酬モデリングの限界に起因しています。
嗜好データセットの不一致。
この研究では、情報理論の観点からこの問題に取り組み、無関係な情報をフィルタリングするための変分情報ボトルネック目標を導入し、モデルの複雑性を調整するメカニズムを開発することにより、報酬モデリングのための一般化可能で堅牢なフレームワーク、すなわち InfoRM を提案します。
特に、潜在空間における過剰最適化と外れ値との相関関係をさらに特定し、報酬の過剰最適化を検出するための有望なツールとして InfoRM を確立しました。
この発見に触発されて、オンライン緩和戦略の開発を促進するための報酬の過剰最適化の指標として、潜在空間の偏差を定量化する統合クラスター偏差スコア (ICDS) を提案します。
幅広い設定とモデルスケール (70M、440M、1.4B、および 7B) での広範な実験により、InfoRM の有効性が裏付けられています。
さらなる分析により、InfoRM の過剰最適化検出メカニズムが効果的であることが明らかになり、RLHF 分野における注目すべき進歩を意味する可能性があります。
コードは承認され次第公開されます。

要約(オリジナル)

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge, which primarily stems from limitations in reward modeling, i.e., generalizability of the reward model and inconsistency in the preference dataset. In this work, we tackle this problem from an information theoretic-perspective, and propose a generalizable and robust framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information and developing a mechanism for model complexity modulation. Notably, we further identify a correlation between overoptimization and outliers in the latent space, establishing InfoRM as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Integrated Cluster Deviation Score (ICDS), which quantifies deviations in the latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and model scales (70M, 440M, 1.4B, and 7B) support the effectiveness of InfoRM. Further analyses reveal that InfoRM’s overoptimization detection mechanism is effective, potentially signifying a notable advancement in the field of RLHF. Code will be released upon acceptance.

arxiv情報

著者	Yuchun Miao,Sen Zhang,Liang Ding,Rong Bao,Lefei Zhang,Dacheng Tao
発行日	2024-02-15 09:21:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mitigating Reward Hacking via Information-Theoretic Reward Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー