Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation

要約

拡散モデルの微調整は、特に大規模言語モデル (LLM) の微調整における目覚ましい進歩と比較すると、生成人工知能 (GenAI) において依然として未開拓の領域です。
安定拡散 (SD) や SDXL などの最先端の拡散モデルは教師付き微調整に依存していますが、一定量のデータを確認するとパフォーマンスは必然的に頭打ちになります。
最近、人間の嗜好データを使用して拡散モデルを微調整するために強化学習 (RL) が採用されていますが、テキストプロンプトごとに少なくとも 2 つの画像 (「勝者」画像と「敗者」画像) が必要です。
この論文では、拡散モデルのセルフプレイ微調整 (SPIN-Diffusion) と呼ばれる革新的な手法を紹介します。この手法では、拡散モデルが以前のバージョンと競合し、反復的な自己改善プロセスが促進されます。
私たちのアプローチは、従来の教師あり微調整および RL 戦略に代わるものを提供し、モデルのパフォーマンスとアライメントの両方を大幅に向上させます。
Pick-a-Pic データセットに対する私たちの実験では、SPIN-Diffusion が最初の反復から人間の好みの調整と視覚的魅力の面で既存の教師付き微調整手法よりも優れていることが明らかになりました。
2 回目の反復では、すべてのメトリックにわたって RLHF ベースの手法のパフォーマンスを上回り、より少ないデータでこれらの結果を達成します。

要約(オリジナル)

Fine-tuning Diffusion Models remains an underexplored frontier in generative artificial intelligence (GenAI), especially when compared with the remarkable progress made in fine-tuning Large Language Models (LLMs). While cutting-edge diffusion models such as Stable Diffusion (SD) and SDXL rely on supervised fine-tuning, their performance inevitably plateaus after seeing a certain volume of data. Recently, reinforcement learning (RL) has been employed to fine-tune diffusion models with human preference data, but it requires at least two images (‘winner’ and ‘loser’ images) for each text prompt. In this paper, we introduce an innovative technique called self-play fine-tuning for diffusion models (SPIN-Diffusion), where the diffusion model engages in competition with its earlier versions, facilitating an iterative self-improvement process. Our approach offers an alternative to conventional supervised fine-tuning and RL strategies, significantly improving both model performance and alignment. Our experiments on the Pick-a-Pic dataset reveal that SPIN-Diffusion outperforms the existing supervised fine-tuning method in aspects of human preference alignment and visual appeal right from its first iteration. By the second iteration, it exceeds the performance of RLHF-based methods across all metrics, achieving these results with less data.

arxiv情報

著者	Huizhuo Yuan,Zixiang Chen,Kaixuan Ji,Quanquan Gu
発行日	2024-02-15 18:59:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Self-Play Fine-Tuning of Diffusion Models for Text-to-Image Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー