TesserAct: Learning 4D Embodied World Models

要約

この論文は、具体化されたエージェントの行動に応じて3Dシーンの動的進化を予測し、空間的および時間的一貫性の両方を提供する新しい4D具体化された世界モデルを学習するための効果的なアプローチを提示します。
RGB-DN（RGB、深さ、および通常の）ビデオをトレーニングすることにより、4Dワールドモデルを学ぶことを提案します。
これにより、詳細な形状、構成、および時間的変化を予測に組み込むことにより、従来の2Dモデルを上回るだけでなく、具体化されたエージェントの正確な逆動的モデルを効果的に学習することもできます。
具体的には、最初に既存のロボット操作ビデオデータセットを、既製のモデルを活用する深さと通常の情報を使用して拡張します。
次に、この注釈付きデータセットでビデオ生成モデルを微調整します。このデータセットは、各フレームのRGB-DN（RGB、深さ、および通常）を共同で予測します。
次に、アルゴリズムを提示して、生成されたRGB、深さ、および通常のビデオを世界の高品質の4Dシーンに直接変換します。
私たちの方法は、具体化されたシナリオからの4Dシーンの予測における時間的および空間的な一貫性を保証し、具体化された環境の新しいビューの合成を可能にし、以前のビデオベースの世界モデルから派生したものを大幅に上回る政策学習を促進します。

要約(オリジナル)

This paper presents an effective approach for learning novel 4D embodied world models, which predict the dynamic evolution of 3D scenes over time in response to an embodied agent’s actions, providing both spatial and temporal consistency. We propose to learn a 4D world model by training on RGB-DN (RGB, Depth, and Normal) videos. This not only surpasses traditional 2D models by incorporating detailed shape, configuration, and temporal changes into their predictions, but also allows us to effectively learn accurate inverse dynamic models for an embodied agent. Specifically, we first extend existing robotic manipulation video datasets with depth and normal information leveraging off-the-shelf models. Next, we fine-tune a video generation model on this annotated dataset, which jointly predicts RGB-DN (RGB, Depth, and Normal) for each frame. We then present an algorithm to directly convert generated RGB, Depth, and Normal videos into a high-quality 4D scene of the world. Our method ensures temporal and spatial coherence in 4D scene predictions from embodied scenarios, enables novel view synthesis for embodied environments, and facilitates policy learning that significantly outperforms those derived from prior video-based world models.

arxiv情報

著者	Haoyu Zhen,Qiao Sun,Hongxin Zhang,Junyan Li,Siyuan Zhou,Yilun Du,Chuang Gan
発行日	2025-04-29 17:59:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TesserAct: Learning 4D Embodied World Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー