Entropy Controllable Direct Preference Optimization

要約

大規模言語モデル (LLM) のポストトレーニングでは、人間のフィードバックからの強化学習 (RLHF) は、人間の好みに合わせた生成を実現する効果的なアプローチです。
Direct Preference Optimization (DPO) を使用すると、報酬モデルを使用せずに単純なバイナリクロスエントロピー損失を使用してポリシートレーニングを行うことができます。
DPO の目的は、参照ポリシーへのモード探索の適合を促進する逆 KL ダイバージェンスによって正規化されます。
それにもかかわらず、逆 KL 発散を最小限に抑えると参照分布のモードを捕捉できない可能性があり、ポリシーのパフォーマンスに悪影響を与える可能性があることを示しています。
この観察に基づいて、我々は、結果として得られるポリシーのエントロピーの制御を可能にし、分布の鮮明さを強化し、それによってモード探索フィッティングをより効果的に可能にする、DPO への簡単な修正である H-DPO を提案します。
私たちの実験では、H-DPO がさまざまなタスクにわたって DPO よりも優れたパフォーマンスを示し、数学的タスクの pass@$k$ 評価で優れた結果を示しました。
さらに、H-DPO は実装が簡単で、DPO の損失計算にわずかな変更を加えるだけで済むため、実用性が高く、LLM のトレーニングにおける幅広い用途に有望です。

要約(オリジナル)

In the post-training of large language models (LLMs), Reinforcement Learning from Human Feedback (RLHF) is an effective approach to achieve generation aligned with human preferences. Direct Preference Optimization (DPO) allows for policy training with a simple binary cross-entropy loss without a reward model. The objective of DPO is regularized by reverse KL divergence that encourages mode-seeking fitting to the reference policy. Nonetheless, we indicate that minimizing reverse KL divergence could fail to capture a mode of the reference distribution, which may hurt the policy’s performance. Based on this observation, we propose a simple modification to DPO, H-DPO, which allows for control over the entropy of the resulting policy, enhancing the distribution’s sharpness and thereby enabling mode-seeking fitting more effectively. In our experiments, we show that H-DPO outperformed DPO across various tasks, demonstrating superior results in pass@$k$ evaluations for mathematical tasks. Moreover, H-DPO is simple to implement, requiring only minor modifications to the loss calculation of DPO, which makes it highly practical and promising for wide-ranging applications in the training of LLMs.

arxiv情報

著者	Motoki Omura,Yasuhiro Fujita,Toshiki Kataoka
発行日	2024-11-12 07:09:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Entropy Controllable Direct Preference Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー