Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

要約

直接選好最適化（DPO）は、明示的な報酬モデルなしで人間の好みを最適化することにより、大規模な言語モデル（LLMS）の人間のフィードバック（RLHF）からの強化学習を簡素化します。
DPOトレーニング中に、参照モデルがデータ重量アジャスターの役割を果たしていることがわかります。
ただし、DPOでポリシーモデルと参照モデルを同じように初期化する一般的な慣行は、非効率的なデータ利用につながり、パフォーマンスの上限を課す可能性があります。
一方、単純な選好最適化（SIMPO）に参照モデルがないため、トレーニングの堅牢性が低下し、壊滅的な忘却を防ぐためにより厳格な条件が必要になります。
この作業では、ガイド参照モデルを活用することにより優先最適化パフォーマンスを向上させるシンプルで効果的なDPOベースのトレーニングパラダイムであるPre-DPOを提案します。
このリファレンスモデルは、トレーニング選好データを通じて達成可能な最適なポリシー状態への視線を提供し、モデルに適したサンプルにより高い重みを適応的に割り当て、より適していないものに減少するサンプルに適応的に割り当てるガイドメカニズムとして機能します。
Alpacaeval 2.0およびArena-Hard V0.1ベンチマークに関する広範な実験は、外部モデルや追加データに依存せずに、Pre-DPOがDPOとSIMPOの両方のパフォーマンスを一貫して改善することを示しています。

要約(オリジナル)

Direct Preference Optimization (DPO) simplifies reinforcement learning from human feedback (RLHF) for large language models (LLMs) by directly optimizing human preferences without an explicit reward model. We find that during DPO training, the reference model plays the role of a data weight adjuster. However, the common practice of initializing the policy and reference models identically in DPO can lead to inefficient data utilization and impose a performance ceiling. Meanwhile, the lack of a reference model in Simple Preference Optimization (SimPO) reduces training robustness and necessitates stricter conditions to prevent catastrophic forgetting. In this work, we propose Pre-DPO, a simple yet effective DPO-based training paradigm that enhances preference optimization performance by leveraging a guiding reference model. This reference model provides foresight into the optimal policy state achievable through the training preference data, serving as a guiding mechanism that adaptively assigns higher weights to samples more suitable for the model and lower weights to those less suitable. Extensive experiments on AlpacaEval 2.0 and Arena-Hard v0.1 benchmarks demonstrate that Pre-DPO consistently improves the performance of both DPO and SimPO, without relying on external models or additional data.

arxiv情報

著者	Junshu Pan,Wei Shen,Shulin Huang,Qiji Zhou,Yue Zhang
発行日	2025-04-25 07:47:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pre-DPO: Improving Data Utilization in Direct Preference Optimization Using a Guiding Reference Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー