InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

要約

人間のフィードバックからの強化学習（RLHF）は、言語モデルを人間の価値観に合わせることに成功しているが、報酬のハッキング（報酬の過剰最適化とも呼ばれる）は依然として重要な課題である。この問題は主に、報酬モデル（RM）が人間の嗜好とは無関係な偽の特徴を用いて報酬を計算する、報酬の誤汎化から生じる。本研究では、情報理論的な観点からこの問題に取り組み、無関係な情報をフィルタリングするための変分情報ボトルネック目的を導入することで、報酬モデリングのためのフレームワーク、すなわちInfoRMを提案する。さらに、InfoRMのIB潜在空間において、過剰最適化と外れ値の間に相関があることを明らかにし、報酬の過剰最適化を検出するための有望なツールとして確立した。この発見に触発され、我々はオンライン緩和戦略の開発を促進するために、報酬の過剰最適化の指標として、IB潜在空間における偏差を定量化するクラスタ分離指数（CSI）を提案する。幅広い設定とRMスケール（70M、440M、1.4B、7B）での広範な実験により、InfoRMの有効性が実証された。さらなる分析により、InfoRMの過剰最適化検出メカニズムが効果的であるだけでなく、幅広いデータセットにおいてロバストであることが明らかになり、RLHFの分野における顕著な進歩を意味する。コードは採用され次第公開される。

要約(オリジナル)

Despite the success of reinforcement learning from human feedback (RLHF) in aligning language models with human values, reward hacking, also termed reward overoptimization, remains a critical challenge. This issue primarily arises from reward misgeneralization, where reward models (RMs) compute reward using spurious features that are irrelevant to human preferences. In this work, we tackle this problem from an information-theoretic perspective and propose a framework for reward modeling, namely InfoRM, by introducing a variational information bottleneck objective to filter out irrelevant information. Notably, we further identify a correlation between overoptimization and outliers in the IB latent space of InfoRM, establishing it as a promising tool for detecting reward overoptimization. Inspired by this finding, we propose the Cluster Separation Index (CSI), which quantifies deviations in the IB latent space, as an indicator of reward overoptimization to facilitate the development of online mitigation strategies. Extensive experiments on a wide range of settings and RM scales (70M, 440M, 1.4B, and 7B) demonstrate the effectiveness of InfoRM. Further analyses reveal that InfoRM’s overoptimization detection mechanism is not only effective but also robust across a broad range of datasets, signifying a notable advancement in the field of RLHF. The code will be released upon acceptance.

arxiv情報

著者	Yuchun Miao,Sen Zhang,Liang Ding,Rong Bao,Lefei Zhang,Dacheng Tao
発行日	2024-11-01 06:30:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー