VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

要約

スパースビューから3Dシーンを回復することは、その固有の不適切な問題のために挑戦的な作業です。
従来の方法では、問題を軽減するために、特殊なソリューション（幾何学的正規化またはフィードフォワードの決定論的モデル）を開発しました。
ただし、視覚情報が不十分な入力ビュー全体で最小限のオーバーラップにより、パフォーマンスの劣化に苦しんでいます。
幸いなことに、最近のビデオ生成モデルは、もっともらしい3D構造を使用してビデオクリップを生成できるため、この課題に対処することに有望です。
大規模な前提条件のビデオ拡散モデルを搭載したいくつかの先駆的な研究は、動画生成事前の可能性を探求し、まばらなビューから3Dシーンを作成し始めます。
印象的な改善にもかかわらず、それらは推論時間の遅さと3D制約の欠如によって制限され、実際のジオメトリ構造と一致しない非効率性と再構築アーティファクトにつながります。
このホワイトペーパーでは、ビデオ拡散モデルを蒸留して3Dシーンを1つのステップで生成することを提案し、ビデオから3Dまでのギャップを埋めるための効率的かつ効果的なツールを構築することを目指しています。
具体的には、3Dを意識したリープフロー蒸留戦略を設計して、時間がかかる冗長な情報を飛躍させ、動的除去ポリシーネットワークを訓練して、推論中の最適な跳躍タイムステップを適応的に決定します。
広範な実験は、Videosceneが以前のビデオ拡散モデルよりも高速かつ優れた3Dシーン生成の結果を達成し、将来のビデオから3Dアプリケーションの効率的なツールとしての可能性を強調することを示しています。
プロジェクトページ：https：//hanyang-21.github.io/videoscene

要約(オリジナル)

Recovering 3D scenes from sparse views is a challenging task due to its inherent ill-posed problem. Conventional methods have developed specialized solutions (e.g., geometry regularization or feed-forward deterministic model) to mitigate the issue. However, they still suffer from performance degradation by minimal overlap across input views with insufficient visual information. Fortunately, recent video generative models show promise in addressing this challenge as they are capable of generating video clips with plausible 3D structures. Powered by large pretrained video diffusion models, some pioneering research start to explore the potential of video generative prior and create 3D scenes from sparse views. Despite impressive improvements, they are limited by slow inference time and the lack of 3D constraint, leading to inefficiencies and reconstruction artifacts that do not align with real-world geometry structure. In this paper, we propose VideoScene to distill the video diffusion model to generate 3D scenes in one step, aiming to build an efficient and effective tool to bridge the gap from video to 3D. Specifically, we design a 3D-aware leap flow distillation strategy to leap over time-consuming redundant information and train a dynamic denoising policy network to adaptively determine the optimal leap timestep during inference. Extensive experiments demonstrate that our VideoScene achieves faster and superior 3D scene generation results than previous video diffusion models, highlighting its potential as an efficient tool for future video to 3D applications. Project Page: https://hanyang-21.github.io/VideoScene

arxiv情報

著者	Hanyang Wang,Fangfu Liu,Jiawei Chi,Yueqi Duan
発行日	2025-04-02 17:59:21+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VideoScene: Distilling Video Diffusion Model to Generate 3D Scenes in One Step

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー