Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

要約

拡散モデルを下流の目的に合わせて調整することは、実際のアプリケーションに不可欠です。
ただし、標準的な位置合わせ手法は、数ステップの拡散モデルに直接適用すると、ステップの一般化に苦労することが多く、さまざまなノイズ除去ステップシナリオ間でパフォーマンスに一貫性がなくなります。
これに対処するために、数ステップの拡散モデルに合わせた新しい調整方法である Stepwise Diffusion Policy Optimization (SDPO) を導入します。
軌道レベルの最適化のために各ノイズ除去軌道の最終ステップのみからの単一のまばらな報酬に依存する従来のアプローチとは異なり、SDPO には中間ステップごとに密な報酬フィードバックが組み込まれています。
SDPO は、ペアのサンプル間の高密度報酬の違いを学習することで、数ステップの拡散モデルの段階的な最適化を容易にし、すべてのノイズ除去ステップにわたって一貫した調整を保証します。
安定した効率的なトレーニングを促進するために、SDPO は、高密度の報酬の段階的な粒度を効果的に活用するように設計されたいくつかの新しい戦略を特徴とするオンライン強化学習フレームワークを導入しています。
実験結果は、SDPO がさまざまなステップ構成にわたる報酬ベースの調整において従来の方法よりも一貫して優れていることを示しており、その堅牢なステップ一般化機能が強調されています。
コードは https://github.com/ZiyiZhang27/sdpo で入手できます。

要約(オリジナル)

Aligning diffusion models with downstream objectives is essential for their practical applications. However, standard alignment methods often struggle with step generalization when directly applied to few-step diffusion models, leading to inconsistent performance across different denoising step scenarios. To address this, we introduce Stepwise Diffusion Policy Optimization (SDPO), a novel alignment method tailored for few-step diffusion models. Unlike prior approaches that rely on a single sparse reward from only the final step of each denoising trajectory for trajectory-level optimization, SDPO incorporates dense reward feedback at every intermediate step. By learning the differences in dense rewards between paired samples, SDPO facilitates stepwise optimization of few-step diffusion models, ensuring consistent alignment across all denoising steps. To promote stable and efficient training, SDPO introduces an online reinforcement learning framework featuring several novel strategies designed to effectively exploit the stepwise granularity of dense rewards. Experimental results demonstrate that SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations, underscoring its robust step generalization capabilities. Code is avaliable at https://github.com/ZiyiZhang27/sdpo.

arxiv情報

著者	Ziyi Zhang,Li Shen,Sen Zhang,Deheng Ye,Yong Luo,Miaojing Shi,Bo Du,Dacheng Tao
発行日	2024-11-18 16:57:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Aligning Few-Step Diffusion Models with Dense Reward Difference Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー