RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers

要約

RGBビデオからのマルチオブジェクト3D再構成のためのトランスフォーマーベースのニューラルネットワークアーキテクチャを提案します。
知識を表現するために、2 つの代替方法に依存しています。フィーチャのグローバル 3D グリッドと、ビュー固有の 2D グリッドの配列です。
専用の双方向アテンションメカニズムを使用して、両者の間で情報を徐々に交換します。
画像形成プロセスに関する知識を活用して、アテンションウェイトマトリックスを大幅にスパース化し、メモリと計算の両方の観点から、現在のハードウェアでアーキテクチャを実行可能にします。
シーン内のオブジェクトを検出し、それらの 3D ポーズと 3D 形状を予測するために、3D フィーチャグリッドの上に DETR スタイルのヘッドを取り付けます。
以前の方法と比較して、私たちのアーキテクチャは単一段階でエンドツーエンドのトレーニングが可能であり、脆弱な追跡手順を必要とせずに、複数のビデオフレームからのシーンについて全体的に推論できます。
挑戦的な Scan2CAD データセットでこの方法を評価します。ここでは、(1) RGB ビデオからの 3D オブジェクトの姿勢推定のための最近の最先端の方法よりも優れています。
(2) マルチビューステレオと RGB-D CAD アライメントを組み合わせた強力な代替方法。
ソースコードを公開する予定です。

要約(オリジナル)

We propose a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix, making our architecture feasible on current hardware, both in terms of memory and computation. We attach a DETR-style head on top of the 3D feature grid in order to detect the objects in the scene and to predict their 3D pose and 3D shape. Compared to previous methods, our architecture is single stage, end-to-end trainable, and it can reason holistically about a scene from multiple video frames without needing a brittle tracking step. We evaluate our method on the challenging Scan2CAD dataset, where we outperform (1) recent state-of-the-art methods for 3D object pose estimation from RGB videos; and (2) a strong alternative method combining Multi-view Stereo with RGB-D CAD alignment. We plan to release our source code.

arxiv情報

著者	Michał J. Tyszkiewicz,Kevis-Kokitsi Maninis,Stefan Popov,Vittorio Ferrari
発行日	2022-08-26 08:18:52+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RayTran: 3D pose estimation and shape reconstruction of multiple objects from videos with ray-traced transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー