Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

要約

拡散モデルは画像生成への有望なアプローチであり、競争力のあるパフォーマンスを備えたポーズガイド付き人物画像合成 (PGPIS) に採用されています。
既存の方法は単に人物の外観をターゲットのポーズに合わせるだけですが、ソースの人物画像に関する高レベルの意味の理解が不足しているため、過剰適合する傾向があります。
この論文では、PGPIS 用の新しい Coarse-to-Fine 潜在拡散 (CFLD) 法を提案します。
画像とキャプションのペアやテキストのプロンプトがない場合、事前にトレーニングされたテキストから画像への拡散モデルの生成プロセスを制御するために、純粋に画像に基づいた新しいトレーニングパラダイムを開発します。
知覚洗練されたデコーダは、学習可能なクエリのセットを段階的に洗練し、人物画像の意味的理解を粗粒度のプロンプトとして抽出するように設計されています。
これにより、さまざまな段階でのきめの細かい外観制御と姿勢情報制御を切り離すことが可能になり、潜在的な過剰適合の問題を回避できます。
より現実的なテクスチャの詳細を生成するために、粗粒度のプロンプトを増強するバイアス項としてマルチスケールの粒度の細かい外観特徴をエンコードするハイブリッド粒度アテンションモジュールが提案されています。
DeepFashion ベンチマークの定量的および定性的な実験結果は、PGPIS の最先端技術に対する私たちの手法の優位性を示しています。
コードは https://github.com/YanzuoLu/CFLD で入手できます。

要約(オリジナル)

Diffusion model is a promising approach to image generation and has been employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive performance. While existing methods simply align the person appearance to the target pose, they are prone to overfitting due to the lack of a high-level semantic understanding on the source person image. In this paper, we propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence of image-caption pairs and textual prompts, we develop a novel training paradigm purely based on images to control the generation process of a pre-trained text-to-image diffusion model. A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt. This allows for the decoupling of fine-grained appearance and pose information controls at different stages, and thus circumventing the potential overfitting problem. To generate more realistic texture details, a hybrid-granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. Both quantitative and qualitative experimental results on the DeepFashion benchmark demonstrate the superiority of our method over the state of the arts for PGPIS. Code is available at https://github.com/YanzuoLu/CFLD.

arxiv情報

著者	Yanzuo Lu,Manlin Zhang,Andy J Ma,Xiaohua Xie,Jian-Huang Lai
発行日	2024-04-09 14:12:02+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー