MOTPose: Multi-object 6D Pose Estimation for Dynamic Video Sequences using Attention-based Temporal Fusion

要約

乱雑なゴミ箱ピッキング環境は、姿勢推定モデルにとって困難です。
深層学習によって実現された目覚ましい進歩にもかかわらず、シングルビュー RGB 姿勢推定モデルは、乱雑な動的環境ではパフォーマンスが低下します。
シーンのビデオに含まれる豊富な時間情報を組み込むことにより、オクルージョンの悪影響や環境の動的な性質に対処するモデルの能力が強化される可能性があります。
さらに、共同物体検出および姿勢推定モデルは、タスクの共依存の性質を利用して両方のタスクの精度を向上させるのに適しています。
この目的を達成するために、ビデオシーケンスの複数のフレームにわたって情報を蓄積する、マルチオブジェクト 6D 姿勢推定のための注意ベースの時間的融合を提案します。
MOTPose メソッドは、一連の画像を入力として受け取り、1 回の前方パスですべてのオブジェクトに対して共同オブジェクト検出と姿勢推定を実行します。
クロスアテンションベースの融合モジュールを使用して、オブジェクトの埋め込みとオブジェクトのパラメーターの両方を複数のタイムステップにわたって集約する方法を学習します。
物理的に現実的な乱雑なビンピッキングデータセット SynPick と YCB-Video データセットでメソッドを評価し、姿勢推定精度の向上と物体検出精度の向上を実証します。

要約(オリジナル)

Cluttered bin-picking environments are challenging for pose estimation models. Despite the impressive progress enabled by deep learning, single-view RGB pose estimation models perform poorly in cluttered dynamic environments. Imbuing the rich temporal information contained in the video of scenes has the potential to enhance models ability to deal with the adverse effects of occlusion and the dynamic nature of the environments. Moreover, joint object detection and pose estimation models are better suited to leverage the co-dependent nature of the tasks for improving the accuracy of both tasks. To this end, we propose attention-based temporal fusion for multi-object 6D pose estimation that accumulates information across multiple frames of a video sequence. Our MOTPose method takes a sequence of images as input and performs joint object detection and pose estimation for all objects in one forward pass. It learns to aggregate both object embeddings and object parameters over multiple time steps using cross-attention-based fusion modules. We evaluate our method on the physically-realistic cluttered bin-picking dataset SynPick and the YCB-Video dataset and demonstrate improved pose estimation accuracy as well as better object detection accuracy

arxiv情報

著者	Arul Selvam Periyasamy,Sven Behnke
発行日	2024-03-14 11:59:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MOTPose: Multi-object 6D Pose Estimation for Dynamic Video Sequences using Attention-based Temporal Fusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー