Controlling Human Shape and Pose in Text-to-Image Diffusion Models via Domain Adaptation

要約

3D ヒューマンパラメトリックモデル (SMPL) を使用して、事前学習済みのテキストから画像への拡散モデルで人間の形状と姿勢を条件付きで制御する方法論を紹介します。
新しい条件に準拠するようにこれらの拡散モデルを微調整するには、大規模なデータセットと高品質のアノテーションが必要です。これらは、現実世界のデータではなく合成データ生成を通じてよりコスト効率よく取得できます。
ただし、合成データのドメインギャップとシーンの多様性の低さにより、事前トレーニングされたモデルの視覚的な忠実性が損なわれる可能性があります。
分類器のないガイダンスベクトル内の合成的にトレーニングされた条件付き情報を分離し、それを別の制御ネットワークと組み合わせて生成された画像を入力ドメインに適応させることにより、画質を維持するドメイン適応技術を提案します。
SMPL 制御を実現するために、レンダリングされた人間の合成 SURREAL データセット上で ControlNet ベースのアーキテクチャを微調整し、生成時にドメイン適応を適用します。
実験では、私たちのモデルが 2D ポーズベースの ControlNet よりも優れた形状と姿勢の多様性を達成しながら、視覚的な忠実性を維持し、安定性を向上させることが実証され、ヒューマンアニメーションなどの下流タスクでの有用性が実証されました。

要約(オリジナル)

We present a methodology for conditional control of human shape and pose in pretrained text-to-image diffusion models using a 3D human parametric model (SMPL). Fine-tuning these diffusion models to adhere to new conditions requires large datasets and high-quality annotations, which can be more cost-effectively acquired through synthetic data generation rather than real-world data. However, the domain gap and low scene diversity of synthetic data can compromise the pretrained model’s visual fidelity. We propose a domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain. To achieve SMPL control, we fine-tune a ControlNet-based architecture on the synthetic SURREAL dataset of rendered humans and apply our domain adaptation at generation time. Experiments demonstrate that our model achieves greater shape and pose diversity than the 2d pose-based ControlNet, while maintaining the visual fidelity and improving stability, proving its usefulness for downstream tasks such as human animation.

arxiv情報

著者	Benito Buchheim,Max Reimann,Jürgen Döllner
発行日	2024-11-07 14:02:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Controlling Human Shape and Pose in Text-to-Image Diffusion Models via Domain Adaptation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー