Detailed Human-Centric Text Description-Driven Large Scene Synthesis

要約

テキスト駆動の大規模シーン画像合成は拡散モデルによって大幅に進歩しましたが、それを制御するのは困難です。
追加の空間コントロールと対応するテキストを使用することで、大規模なシーン合成の制御性が向上しましたが、ユーザーが提供するコントロールを使用せずに詳細なテキストの説明を忠実に反映することは依然として困難です。
ここでは、人間中心の詳細なテキスト記述のためのグローバルコンテキストにおける高い忠実性、制御性、自然性を備えた、新しいテキスト駆動の大規模画像合成である DetText2Scene を提案します。
当社の DetText2Scene は、1) 大規模言語モデル (LLM) を活用した詳細な説明からの階層キーポイントボックスレイアウトの生成、2) ビューごとの条件付き結合拡散プロセスで構成され、LLM で生成された接地キーポイントを使用して指定された詳細テキストから大規模なシーンを合成します。
-ボックスレイアウトと 3) ピクセル摂動ベースのピラミッド補間により、大規模なシーンを段階的に調整してグローバルな一貫性を実現します。
当社の DetText2Scene は、テキストから大規模シーンへの合成において、定性的および定量的に従来技術を大幅に上回り、詳細な記述に対する強い忠実性、優れた制御性、およびグローバルなコンテキストにおける優れた自然性を実証します。

要約(オリジナル)

Text-driven large scene image synthesis has made significant progress with diffusion models, but controlling it is challenging. While using additional spatial controls with corresponding texts has improved the controllability of large scene synthesis, it is still challenging to faithfully reflect detailed text descriptions without user-provided controls. Here, we propose DetText2Scene, a novel text-driven large-scale image synthesis with high faithfulness, controllability, and naturalness in a global context for the detailed human-centric text description. Our DetText2Scene consists of 1) hierarchical keypoint-box layout generation from the detailed description by leveraging large language model (LLM), 2) view-wise conditioned joint diffusion process to synthesize a large scene from the given detailed text with LLM-generated grounded keypoint-box layout and 3) pixel perturbation-based pyramidal interpolation to progressively refine the large scene for global coherence. Our DetText2Scene significantly outperforms prior arts in text-to-large scene synthesis qualitatively and quantitatively, demonstrating strong faithfulness with detailed descriptions, superior controllability, and excellent naturalness in a global context.

arxiv情報

著者	Gwanghyun Kim,Dong Un Kang,Hoigi Seo,Hayeon Kim,Se Young Chun
発行日	2023-11-30 16:04:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Detailed Human-Centric Text Description-Driven Large Scene Synthesis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー