Bootstrapping Language Models with DPO Implicit Rewards

要約

大規模な言語モデル（LLMS）における人間のアライメントは、研究の活発な分野です。
最近の画期的な作業である直接選好最適化（DPO）は、RLHFの報酬学習段階をバイパスすることにより、人間のフィードバック（RLHF）からの強化学習における過去の作業からのプロセスを大幅に簡素化しました。
DPOは、トレーニング後、暗黙の報酬モデルを提供します。
この作業では、この暗黙の報酬モデル自体をブートストラップファッションで使用してLLMをさらに整列させることができるという斬新な観察を行います。
私たちのアプローチは、現在のLLMからの報酬を使用して優先データセットを構築することです。これは、後続のDPOラウンドで使用されます。
アプローチをさらに改善するために、2つの改良を組み込みます。1）長さの正規化された報酬の形成は、優先データセットの長さを廃止するようにします。
2）優先データセットの品質を向上させるためのリプレイを経験します。
DPO暗黙の報酬（DICE）との自己調整と名付けられた私たちのアプローチは、アラインメントの大幅な改善を示しています。
これは、外部フィードバックに依存することなく、試したすべての異なるベースモデルについて、Alpacaeval 2の長さの制御された勝利率で8 $ \\％$を超える増加を達成します。
私たちのコードは、https：//github.com/sail-sg/diceで入手できます。

要約(オリジナル)

Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate two refinements to further improve our approach: 1) length-regularized reward shaping to make the preference dataset length-unbiased; 2) experience replay to enhance the quality of the preference dataset. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment. It achieves an increase of more than 8$\\%$ in lengthcontrolled win rate on AlpacaEval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https://github.com/sail-sg/dice.

arxiv情報

著者	Changyu Chen,Zichen Liu,Chao Du,Tianyu Pang,Qian Liu,Arunesh Sinha,Pradeep Varakantham,Min Lin
発行日	2025-03-07 15:26:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Bootstrapping Language Models with DPO Implicit Rewards

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー