Flow-GRPO: Training Flow Matching Models via Online RL

要約

Flow-Grpoを提案します。これは、オンライン強化学習（RL）をフローマッチングモデルに統合する最初の方法です。
私たちのアプローチでは、2つの重要な戦略を使用しています。（1）決定論的な通常の微分方程式（ODE）をすべてのタイムステップで元のモデルの周辺分布と一致させる同等の確率微分方程式（SDE）に変換するODE-SDE変換を使用し、RL探査の統計サンプリングを可能にします。
（2）元の推論のタイムステップ数を保持しながらトレーニングの除去ステップを減らす除去削減戦略、パフォーマンスの低下なしでサンプリング効率を大幅に改善します。
経験的には、フローグルポは複数のテキストから画像へのタスクにわたって効果的です。
複雑な組成の場合、RLチューニングSD3.5は、ほぼ完全なオブジェクトカウント、空間的関係、および細粒属性を生成し、$ 63 \％$から95 \％$から遺伝的精度を高めます。
視覚的なテキストレンダリングでは、その精度は59ドル\％$から92ドル\％$に向上し、テキスト生成を大幅に向上させます。
Flow-Grpoは、人間の好みの整合性の大幅な利益も達成します。
特に、報酬のハッキングはほとんどまたはまったく発生しませんでした。つまり、報酬は画質や多様性のコストで増加せず、両方とも実験で安定したままでした。

要約(オリジナル)

We propose Flow-GRPO, the first method integrating online reinforcement learning (RL) into flow matching models. Our approach uses two key strategies: (1) an ODE-to-SDE conversion that transforms a deterministic Ordinary Differential Equation (ODE) into an equivalent Stochastic Differential Equation (SDE) that matches the original model’s marginal distribution at all timesteps, enabling statistical sampling for RL exploration; and (2) a Denoising Reduction strategy that reduces training denoising steps while retaining the original inference timestep number, significantly improving sampling efficiency without performance degradation. Empirically, Flow-GRPO is effective across multiple text-to-image tasks. For complex compositions, RL-tuned SD3.5 generates nearly perfect object counts, spatial relations, and fine-grained attributes, boosting GenEval accuracy from $63\%$ to $95\%$. In visual text rendering, its accuracy improves from $59\%$ to $92\%$, significantly enhancing text generation. Flow-GRPO also achieves substantial gains in human preference alignment. Notably, little to no reward hacking occurred, meaning rewards did not increase at the cost of image quality or diversity, and both remained stable in our experiments.

arxiv情報

著者	Jie Liu,Gongye Liu,Jiajun Liang,Yangguang Li,Jiaheng Liu,Xintao Wang,Pengfei Wan,Di Zhang,Wanli Ouyang
発行日	2025-05-08 17:58:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Flow-GRPO: Training Flow Matching Models via Online RL

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー