Depth Any Video with Scalable Synthetic Data

要約

ビデオ深度の推定は、一貫性とスケーラブルなグラウンドトゥルースデータの不足によって長い間妨げられており、一貫性のない信頼性の低い結果をもたらしてきました。
このペーパーでは、2 つの主要なイノベーションを通じてこの課題に取り組むモデル、Depth Any Video を紹介します。
まず、スケーラブルな合成データパイプラインを開発し、さまざまな合成環境からリアルタイムのビデオ深度データをキャプチャし、それぞれに正確な深度注釈が付けられた 5 秒間のビデオクリップ 40,000 個を生成します。
第 2 に、生成ビデオ拡散モデルの強力な事前確率を活用して実世界のビデオを効果的に処理し、回転位置エンコーディングやフローマッチングなどの高度な技術を統合して、柔軟性と効率をさらに強化します。
固定長のビデオシーケンスに限定されていた以前のモデルとは異なり、私たちのアプローチでは、さまざまな長さのビデオを処理し、単一フレームであっても、さまざまなフレームレートにわたって堅牢に実行する、新しい混合期間トレーニング戦略を導入しています。
推論では、モデルが最大 150 フレームのシーケンス全体にわたって高解像度ビデオ深度を推論できるようにする深度補間方法を提案します。
私たちのモデルは、空間精度と時間的一貫性の点で、これまでのすべての生成深度モデルよりも優れています。

要約(オリジナル)

Video depth estimation has long been hindered by the scarcity of consistent and scalable ground truth data, leading to inconsistent and unreliable results. In this paper, we introduce Depth Any Video, a model that tackles the challenge through two key innovations. First, we develop a scalable synthetic data pipeline, capturing real-time video depth data from diverse synthetic environments, yielding 40,000 video clips of 5-second duration, each with precise depth annotations. Second, we leverage the powerful priors of generative video diffusion models to handle real-world videos effectively, integrating advanced techniques such as rotary position encoding and flow matching to further enhance flexibility and efficiency. Unlike previous models, which are limited to fixed-length video sequences, our approach introduces a novel mixed-duration training strategy that handles videos of varying lengths and performs robustly across different frame rates-even on single frames. At inference, we propose a depth interpolation method that enables our model to infer high-resolution video depth across sequences of up to 150 frames. Our model outperforms all previous generative depth models in terms of spatial accuracy and temporal consistency.

arxiv情報

著者	Honghui Yang,Di Huang,Wei Yin,Chunhua Shen,Haifeng Liu,Xiaofei He,Binbin Lin,Wanli Ouyang,Tong He
発行日	2024-10-14 17:59:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Depth Any Video with Scalable Synthetic Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー