From Single Images to Motion Policies via Video-Generation Environment Representations

要約

自律的なロボットは通常、周囲の表現を構築し、環境の幾何学に動きを適応させる必要があります。
ここでは、単一の入力RGB画像から、環境と一致する衝突のないモーション生成のポリシーモデルを構築する問題に取り組みます。
単一の画像から3D構造を抽出するには、多くの場合、単眼深度の推定が含まれます。
深さの推定の開発により、Depthanythingなどの大規模な事前に訓練されたモデルが生じています。
ただし、下流のモーション生成にこれらのモデルの出力を使用することは、発生するフラストム型エラーのために困難です。
代わりに、ビデオジェネレーション環境表現（VGER）として知られるフレームワークを提案します。これは、大規模なビデオ生成モデルの進歩を活用して、入力画像に条件付けられた移動カメラビデオを生成します。
マルチビューデータセットを形成するこのビデオのフレームは、事前に訓練された3Dファンデーションモデルに入力して、密なポイントクラウドを生成します。
次に、マルチスケールノイズアプローチを導入して、環境構造の暗黙の表現を訓練し、表現のジオメトリに準拠するモーション生成モデルを構築します。
屋内および屋外の環境の多様なセットでVGEを広範囲に評価します。
単一のRGB入力画像から、シーンのキャプチャされたジオメトリを説明するスムーズな動きを生成する能力を実証します。

要約(オリジナル)

Autonomous robots typically need to construct representations of their surroundings and adapt their motions to the geometry of their environment. Here, we tackle the problem of constructing a policy model for collision-free motion generation, consistent with the environment, from a single input RGB image. Extracting 3D structures from a single image often involves monocular depth estimation. Developments in depth estimation have given rise to large pre-trained models such as DepthAnything. However, using outputs of these models for downstream motion generation is challenging due to frustum-shaped errors that arise. Instead, we propose a framework known as Video-Generation Environment Representation (VGER), which leverages the advances of large-scale video generation models to generate a moving camera video conditioned on the input image. Frames of this video, which form a multiview dataset, are then input into a pre-trained 3D foundation model to produce a dense point cloud. We then introduce a multi-scale noise approach to train an implicit representation of the environment structure and build a motion generation model that complies with the geometry of the representation. We extensively evaluate VGER over a diverse set of indoor and outdoor environments. We demonstrate its ability to produce smooth motions that account for the captured geometry of a scene, all from a single RGB input image.

arxiv情報

著者	Weiming Zhi,Ziyong Ma,Tianyi Zhang,Matthew Johnson-Roberson
発行日	2025-05-25 20:30:25+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

From Single Images to Motion Policies via Video-Generation Environment Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー