Provable Reinforcement Learning from Human Feedback with an Unknown Link Function

要約

RL問題の価値関数から人間の選好がどのように生成されるかを特徴付けるリンク関数は、RLHFアルゴリズムを設計する上で極めて重要な要素である。DPOやPPOのような経験的研究における最先端のものを含む、ほとんど全てのRLHFアルゴリズムは、リンク関数がエージェントに既知であると仮定している（例えば、Bradley-Terryモデルに従ったロジスティック関数）が、人間の嗜好の複雑な性質を考慮すると、これは間違いなく非現実的である。リンク関数の誤仕様を避けるために、本論文では未知のリンク関数を持つ一般的なRLHF問題を研究する。我々は新しいゼロ次政策最適化法に基づくZSPOと呼ばれる新しい政策最適化アルゴリズムを提案する。ここで重要なのは、人間の嗜好を用いて、真の政策勾配方向と正の相関を持つパラメータ更新方向を構築することである。ZSPOは、値関数の差から勾配を推定する代わりに、値関数の差の符号を推定することでこれを実現するため、リンク関数を知る必要がない。穏やかな条件下では、ZSPOは政策の反復回数と反復ごとの軌道数に依存する多項式収束率で定常政策に収束する。数値結果はまた、リンク関数ミスマッチの下でのZSPOの優位性を示す。

要約(オリジナル)

Link functions, which characterize how human preferences are generated from the value function of an RL problem, are a crucial component in designing RLHF algorithms. Almost all RLHF algorithms, including state-of-the-art ones in empirical studies such as DPO and PPO, assume the link function is known to the agent (e.g., a logistic function according to the Bradley-Terry model), which is arguably unrealistic considering the complex nature of human preferences. To avoid link function mis-specification, this paper studies general RLHF problems with unknown link functions. We propose a novel policy optimization algorithm called ZSPO based on a new zeroth-order policy optimization method, where the key is to use human preference to construct a parameter update direction that is positively correlated with the true policy gradient direction. ZSPO achieves it by estimating the sign of the value function difference instead of estimating the gradient from the value function difference, so it does not require knowing the link function. Under mild conditions, ZSPO converges to a stationary policy with a polynomial convergence rate depending on the number of policy iterations and trajectories per iteration. Numerical results also show the superiority of ZSPO under link function mismatch.

arxiv情報

著者	Qining Zhang,Lei Ying
発行日	2025-06-03 16:42:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Provable Reinforcement Learning from Human Feedback with an Unknown Link Function

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー