Can Generative Video Models Help Pose Estimation?

要約

重なりがほとんどまたはまったくない画像からのペアごとの姿勢推定は、コンピュータービジョンにおける未解決の課題です。
既存の手法は、大規模なデータセットでトレーニングされた手法であっても、識別可能な対応関係や視覚的な重複が欠如しているため、これらのシナリオでは困難を伴います。
さまざまなシーンから空間関係を推測する人間の能力に触発され、事前トレーニングされた生成ビデオモデル内にエンコードされた豊富な事前分布を活用する新しいアプローチである InterPose を提案します。
私たちは、ビデオモデルを使用して 2 つの入力画像間の中間フレームを幻覚させ、密度の高い視覚的な遷移を効果的に作成し、姿勢推定の問題を大幅に簡素化することを提案します。
現在のビデオモデルでは依然として信じがたい動きや一貫性のないジオメトリが生成される可能性があるため、サンプリングされたビデオからの姿勢予測の一貫性を評価する自己一貫性スコアを導入します。
私たちのアプローチが 3 つの最先端のビデオモデル間で一般化されていることを実証し、屋内、屋外、オブジェクト中心のシーンを含む 4 つの多様なデータセットで最先端の DUSt3R と比較して一貫した改善を示しています。
私たちの調査結果は、3D データよりも容易に利用できる、膨大な量のビデオデータでトレーニングされた大規模な生成モデルを活用することで、姿勢推定モデルを改善するための有望な手段を示唆しています。
結果については、プロジェクトページ https://inter-pose.github.io/ をご覧ください。

要約(オリジナル)

Pairwise pose estimation from images with little or no overlap is an open challenge in computer vision. Existing methods, even those trained on large-scale datasets, struggle in these scenarios due to the lack of identifiable correspondences or visual overlap. Inspired by the human ability to infer spatial relationships from diverse scenes, we propose a novel approach, InterPose, that leverages the rich priors encoded within pre-trained generative video models. We propose to use a video model to hallucinate intermediate frames between two input images, effectively creating a dense, visual transition, which significantly simplifies the problem of pose estimation. Since current video models can still produce implausible motion or inconsistent geometry, we introduce a self-consistency score that evaluates the consistency of pose predictions from sampled videos. We demonstrate that our approach generalizes among three state-of-the-art video models and show consistent improvements over the state-of-the-art DUSt3R on four diverse datasets encompassing indoor, outdoor, and object-centric scenes. Our findings suggest a promising avenue for improving pose estimation models by leveraging large generative models trained on vast amounts of video data, which is more readily available than 3D data. See our project page for results: https://inter-pose.github.io/.

arxiv情報

著者	Ruojin Cai,Jason Y. Zhang,Philipp Henzler,Zhengqi Li,Noah Snavely,Ricardo Martin-Brualla
発行日	2024-12-20 18:58:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Can Generative Video Models Help Pose Estimation?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー