LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

要約

ビジュアル生成において高度なユーザー制御性を実現するには、多くの場合、レイアウトなどの複雑で粒度の細かい入力が必要です。
しかし、このような入力は単純なテキスト入力と比較すると、ユーザーにとって大きな負担となります。
この問題に対処するために、私たちは、ラージ言語モデル (LLM) がテキスト条件からレイアウトを生成することでビジュアルプランナーとして機能し、ビジュアル生成モデルと連携する方法を研究します。
私たちは、LLM のビジュアルプランニングスキルを強化するために、スタイルシート言語でコンテキスト内のビジュアルデモンストレーションを作成する方法である LayoutGPT を提案します。
LayoutGPT は、2D 画像から 3D 屋内シーンに至るまで、複数のドメインで妥当なレイアウトを生成できます。
LayoutGPT は、数値や空間の関係などの難しい言語概念を、忠実なテキストから画像への生成のためのレイアウト配置に変換する際にも優れたパフォーマンスを示します。
ダウンストリームの画像生成モデルと組み合わせると、LayoutGPT はテキストから画像へのモデル/システムよりも 20 ～ 40% 優れたパフォーマンスを発揮し、数値的および空間的正確さのための視覚的なレイアウトの設計において人間のユーザーと同等のパフォーマンスを達成します。
最後に、LayoutGPT は 3D 屋内シーン合成において教師あり手法と同等のパフォーマンスを達成し、複数の視覚領域におけるその有効性と可能性を実証しています。

要約(オリジナル)

Attaining a high degree of user controllability in visual generation often requires intricate, fine-grained inputs like layouts. However, such inputs impose a substantial burden on users when compared to simple text inputs. To address the issue, we study how Large Language Models (LLMs) can serve as visual planners by generating layouts from text conditions, and thus collaborate with visual generative models. We propose LayoutGPT, a method to compose in-context visual demonstrations in style sheet language to enhance the visual planning skills of LLMs. LayoutGPT can generate plausible layouts in multiple domains, ranging from 2D images to 3D indoor scenes. LayoutGPT also shows superior performance in converting challenging language concepts like numerical and spatial relations to layout arrangements for faithful text-to-image generation. When combined with a downstream image generation model, LayoutGPT outperforms text-to-image models/systems by 20-40% and achieves comparable performance as human users in designing visual layouts for numerical and spatial correctness. Lastly, LayoutGPT achieves comparable performance to supervised methods in 3D indoor scene synthesis, demonstrating its effectiveness and potential in multiple visual domains.

arxiv情報

著者	Weixi Feng,Wanrong Zhu,Tsu-jui Fu,Varun Jampani,Arjun Akula,Xuehai He,Sugato Basu,Xin Eric Wang,William Yang Wang
発行日	2023-05-24 17:56:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー