Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

要約

テキストからインタラクティブな3Dシーンを合成することは、ゲーム、バーチャルリアリティ、具現化AIにとって不可欠である。しかし、既存の手法はいくつかの課題に直面している。学習ベースのアプローチは小規模な屋内データセットに依存しており、シーンの多様性とレイアウトの複雑性が制限されている。大規模言語モデル（LLM）は多様なテキスト領域の知識を活用することができるが、空間的リアリズムに苦戦し、しばしば常識を無視した不自然なオブジェクト配置を生成する。我々の重要な洞察は、視覚認識は、LLMに欠けている現実的な空間ガイダンスを提供することにより、このギャップを埋めることができるということである。この目的のために、LLMベースのシーンプランニングと視覚ガイドによるレイアウトの改良を統合した、訓練不要のエージェントフレームワークであるScenethesisを紹介する。テキストプロンプトが与えられると、ScenethesisはまずLLMを用いて粗いレイアウトを作成する。次にビジョンモジュールが、画像ガイダンスを生成し、オブジェクト間の関係を把握するためにシーン構造を抽出することにより、レイアウトを洗練させる。次に、最適化モジュールが、正確なポーズアライメントと物理的妥当性を反復的に実施し、オブジェクトの貫通や不安定性などのアーティファクトを防止する。最後に、判定モジュールが空間的な一貫性を検証します。包括的な実験により、Scenethesisは多様でリアル、かつ物理的に妥当な3Dインタラクティブシーンを生成することが示され、バーチャルコンテンツの作成、シミュレーション環境、および具現化されたAI研究に有用であることが示された。

要約(オリジナル)

Synthesizing interactive 3D scenes from text is essential for gaming, virtual reality, and embodied AI. However, existing methods face several challenges. Learning-based approaches depend on small-scale indoor datasets, limiting the scene diversity and layout complexity. While large language models (LLMs) can leverage diverse text-domain knowledge, they struggle with spatial realism, often producing unnatural object placements that fail to respect common sense. Our key insight is that vision perception can bridge this gap by providing realistic spatial guidance that LLMs lack. To this end, we introduce Scenethesis, a training-free agentic framework that integrates LLM-based scene planning with vision-guided layout refinement. Given a text prompt, Scenethesis first employs an LLM to draft a coarse layout. A vision module then refines it by generating an image guidance and extracting scene structure to capture inter-object relations. Next, an optimization module iteratively enforces accurate pose alignment and physical plausibility, preventing artifacts like object penetration and instability. Finally, a judge module verifies spatial coherence. Comprehensive experiments show that Scenethesis generates diverse, realistic, and physically plausible 3D interactive scenes, making it valuable for virtual content creation, simulation environments, and embodied AI research.

arxiv情報

著者	Lu Ling,Chen-Hsuan Lin,Tsung-Yi Lin,Yifan Ding,Yu Zeng,Yichen Sheng,Yunhao Ge,Ming-Yu Liu,Aniket Bera,Zhaoshuo Li
発行日	2025-05-05 17:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Scenethesis: A Language and Vision Agentic Framework for 3D Scene Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー