Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

要約

好みのフィードバックから学習することは、最新の言語モデル (LM) の生成品質とパフォーマンスを向上させるための重要なステップとして浮上しています。
広く使用されているにもかかわらず、好みに基づく学習の適用方法は、使用されるデータ、学習アルゴリズム、評価が異なるため大きく異なり、各側面の影響を解きほぐすことが困難です。
この研究では、嗜好ベースの学習の 4 つの主要な側面 (嗜好データ、学習アルゴリズム、報酬モデル、ポリシートレーニングプロンプト) を特定し、これらのコンポーネントが下流モデルのパフォーマンスに与える影響を体系的に調査し、嗜好に応じた強力な学習のレシピを提案します。
フィードバック。
私たちの調査結果は、パフォーマンスにはすべての側面が重要であることを示しています。より優れた嗜好データが最大の改善につながり、次に学習アルゴリズムの選択、改善された報酬モデルの使用、最後にポリシートレーニング用の追加のラベルなしプロンプトの使用が続きます。
特に、PPO は数学分野で最大 2.5%、一般ドメインで 1.2% も DPO を上回っています。
高品質の嗜好データにより、指示への追従性と真実性が最大 8% 向上します。
報酬モデルをスケールアップすると、数学的評価で最大 5% の大幅な向上が見られたにもかかわらず、驚くべきことに、他のカテゴリではわずかな改善が見られました。
モデルのトレーニング (https://github.com/hamishivi/EasyLM) とモデルの評価 (https://github.com/allenai/open-instruct) に使用されるコードを、モデルとデータセット自体 (https://github.com/allenai/open-instruct) とともに公開します。
://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618)。

要約(オリジナル)

Learning from preference feedback has emerged as an essential step for improving the generation quality and performance of modern language models (LMs). Despite its widespread use, the way preference-based learning is applied varies wildly, with differing data, learning algorithms, and evaluations used, making disentangling the impact of each aspect difficult. In this work, we identify four core aspects of preference-based learning: preference data, learning algorithm, reward model, and policy training prompts, systematically investigate the impact of these components on downstream model performance, and suggest a recipe for strong learning for preference feedback. Our findings indicate that all aspects are important for performance, with better preference data leading to the largest improvements, followed by the choice of learning algorithm, the use of improved reward models, and finally the use of additional unlabeled prompts for policy training. Notably, PPO outperforms DPO by up to 2.5% in math and 1.2% in general domains. High-quality preference data leads to improvements of up to 8% in instruction following and truthfulness. Despite significant gains of up to 5% in mathematical evaluation when scaling up reward models, we surprisingly observe marginal improvements in other categories. We publicly release the code used for training (https://github.com/hamishivi/EasyLM) and evaluating (https://github.com/allenai/open-instruct) our models, along with the models and datasets themselves (https://huggingface.co/collections/allenai/tulu-v25-suite-66676520fd578080e126f618).

arxiv情報

著者	Hamish Ivison,Yizhong Wang,Jiacheng Liu,Zeqiu Wu,Valentina Pyatkin,Nathan Lambert,Noah A. Smith,Yejin Choi,Hannaneh Hajishirzi
発行日	2024-06-13 16:17:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー