Generalizing Reward Modeling for Out-of-Distribution Preference Learning

要約

大規模言語モデル (LLM) を使用した好み学習 (PL) は、LLM の世代を人間の好みに合わせることを目的としています。
ヒューマンフィードバックからの強化学習 (RLHF) に関するこれまでの研究では、分散型 PL において有望な結果が得られることが実証されています。
ただし、人間のフィードバックを取得するのは難しいため、遭遇するすべての分布に対して報酬モデルを個別にトレーニングすることは困難です。
したがって、配布外 (OOD) PL は、限られたプリファレンスフィードバックで LLM の汎化能力を強化するのに実際に役立ちます。
この研究では、メタ学習アプローチを通じて一般的な報酬モデルを最適化することで OOD PL に対処します。
メタトレーニング中に、バイレベル最適化アルゴリズムを利用して、さまざまな分布にわたって人間の好みに合わせてポリシー学習を誘導できる報酬モデルを学習します。
テスト分布に遭遇すると、メタテスト手順は、PL の学習された報酬モデルを使用して、正規化されたポリシーの最適化を実行します。
合理的な仮定の下で、バイレベル最適化アルゴリズムの収束率を理論的に実証します。
さらに、20 の保持されたドメインにわたって 2 つのテキスト生成タスクの実験を実施し、さまざまな評価指標にわたってさまざまな強力なベースラインを上回りました。

要約(オリジナル)

Preference learning (PL) with large language models (LLMs) aims to align the LLMs’ generations with human preferences. Previous work on reinforcement learning from human feedback (RLHF) has demonstrated promising results in in-distribution PL. However, due to the difficulty of obtaining human feedback, discretely training reward models for every encountered distribution is challenging. Thus, out-of-distribution (OOD) PL is practically useful for enhancing the generalization ability of LLMs with limited preference feedback. This work addresses OOD PL by optimizing a general reward model through a meta-learning approach. During meta-training, a bilevel optimization algorithm is utilized to learn a reward model capable of guiding policy learning to align with human preferences across various distributions. When encountering a test distribution, the meta-test procedure conducts regularized policy optimization using the learned reward model for PL. We theoretically demonstrate the convergence rate of the bilevel optimization algorithm under reasonable assumptions. Additionally, we conduct experiments on two text generation tasks across 20 held-out domains and outperform a variety of strong baselines across various evaluation metrics.

arxiv情報

著者	Chen Jia
発行日	2024-02-22 18:20:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generalizing Reward Modeling for Out-of-Distribution Preference Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー