GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

要約

少ないショットデータで高性能の報酬モデルをトレーニングする機能は、人間のフィードバック（RLHF）からの強化学習の効率とスケーラビリティを高めるために重要です。
小さなデータセットでトレーニングされた生成報酬モデルを可能にして、大規模なデータセットでトレーニングされたものと同等のパフォーマンスを実現できるデータの増強と拡張フレームワークを提案します。
直接選好最適化（DPO）などの生成報酬モデルをトレーニングする従来の方法は、サンプルペアリングの非効率性と限られたデータの多様性によって制約されます。
この作業では、優先順位の改良性が導入されます。これは、さまざまな高品質の優先関係を明らかにするために、考え方（COT）サンプリングを採用しています。
また、微妙な優先レベルを割り当てるための困惑ベースのスコアリングメカニズムを組み込み、マルチレベルの直接選好最適化（M-DPO）を利用して、モデルがサンプル間のより細かい優先嗜好の違いをキャプチャできるようにします。
実験結果は、提案された方法がデータの効率とモデルのパフォーマンスを大幅に向上させ、いくつかのショット設定でトレーニングされた報酬モデルを可能にして、大規模なデータセットでトレーニングされたものと同等の結果を達成できることを示しています。
この研究では、報酬モデルの最適化を進めるためのデータ効率の高い戦略の可能性を強調し、低リソースのRLHFアプリケーションに堅牢なソリューションを提供します。

要約(オリジナル)

The ability to train high-performing reward models with few-shot data is critical for enhancing the efficiency and scalability of Reinforcement Learning from Human Feedback (RLHF). We propose a data augmentation and expansion framework that enables generative reward models trained on small datasets to achieve comparable performance to those trained on large-scale datasets. Traditional methods to train a generative reward model, such as Direct Preference Optimization (DPO), are constrained by inefficiencies in sample pairing and limited data diversity. This work introduces preference refinement, which employs Chain-of-Thought (CoT) sampling to uncover diverse and high-quality preference relationships. It also incorporates a perplexity-based scoring mechanism to assign nuanced preference levels and utilizes Multi-level Direct Preference Optimization (M-DPO) to enable the model to capture finer-grained preference differences between samples. Experimental results demonstrate that the proposed method significantly enhances data efficiency and model performance, enabling reward models trained in a few-shot setting to achieve results on par with those trained on large-scale datasets. This study underscores the potential of data-efficient strategies in advancing reward model optimization, offering a robust solution for low-resource RLHF applications.

arxiv情報

著者	Yiyang Zhao,Huiyu Bai,Xuejiao Zhao
発行日	2025-06-10 16:37:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

GFRIEND: Generative Few-shot Reward Inference through EfficieNt DPO

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー