MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

要約

本稿では、1枚の画像から3Dシーンを合成生成する新しいパラダイムであるMIDIを紹介する。再構成や検索技術に依存する既存の手法や、多段階のオブジェクトごとの生成を採用する最近のアプローチとは異なり、MIDIは、事前に訓練された画像から3Dオブジェクト生成モデルをマルチインスタンス拡散モデルに拡張し、正確な空間関係と高い汎化性を持つ複数の3Dインスタンスの同時生成を可能にする。MIDIは、複雑なマルチステッププロセスを必要とすることなく、オブジェクト間の相互作用と空間的コヒーレンスを生成プロセス内で直接効果的に捉える、新しいマルチインスタンスアテンションメカニズムを中核に組み込んでいる。この方法は、部分的なオブジェクト画像とグローバルなシーンコンテキストを入力として利用し、3D生成中のオブジェクト補完を直接モデル化する。学習時には、正則化のために単一オブジェクトデータを取り入れながら、限られた量のシーンレベルデータを用いて3Dインスタンス間の相互作用を効果的に監視し、それにより事前に学習された汎化能力を維持する。MIDIは、合成データ、実世界のシーンデータ、およびテキストから画像への拡散モデルによって生成された様式化されたシーン画像に対する評価を通して検証された、画像からシーンへの生成における最先端の性能を示す。

要約(オリジナル)

This paper introduces MIDI, a novel paradigm for compositional 3D scene generation from a single image. Unlike existing methods that rely on reconstruction or retrieval techniques or recent approaches that employ multi-stage object-by-object generation, MIDI extends pre-trained image-to-3D object generation models to multi-instance diffusion models, enabling the simultaneous generation of multiple 3D instances with accurate spatial relationships and high generalizability. At its core, MIDI incorporates a novel multi-instance attention mechanism, that effectively captures inter-object interactions and spatial coherence directly within the generation process, without the need for complex multi-step processes. The method utilizes partial object images and global scene context as inputs, directly modeling object completion during 3D generation. During training, we effectively supervise the interactions between 3D instances using a limited amount of scene-level data, while incorporating single-object data for regularization, thereby maintaining the pre-trained generalization ability. MIDI demonstrates state-of-the-art performance in image-to-scene generation, validated through evaluations on synthetic data, real-world scene data, and stylized scene images generated by text-to-image diffusion models.

arxiv情報

著者	Zehuan Huang,Yuan-Chen Guo,Xingqiao An,Yunhan Yang,Yangguang Li,Zi-Xin Zou,Ding Liang,Xihui Liu,Yan-Pei Cao,Lu Sheng
発行日	2024-12-04 18:52:40+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

MIDI: Multi-Instance Diffusion for Single Image to 3D Scene Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー