Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable Attention and Query Aggregation

要約

オブジェクトの姿勢推定は、コンピュータービジョンにおける長年の問題です。
最近、アテンションベースのビジョントランスフォーマーモデルは、多くのコンピュータービジョンアプリケーションで最先端の結果を達成しています。
注意メカニズムの順列不変の性質を利用して、ビジョントランスフォーマーモデルのファミリーは複数オブジェクトの姿勢推定を集合予測問題として定式化します。
ただし、複数オブジェクトの姿勢推定のための既存のビジョントランスフォーマーモデルは、専らアテンションメカニズムに依存しています。
一方、畳み込みニューラルネットワークは、さまざまな帰納的バイアスをそのアーキテクチャに組み込んでいます。
この論文では、多物体の姿勢推定のためのビジョントランスフォーマーモデルに誘導バイアスを組み込むことを調査します。これにより、コストのかかる世界的な注目を回避しながら、長距離の依存関係の学習が容易になります。
特に、多重解像度の変形可能なアテンションを使用します。このアテンション操作は、いくつかの変形された参照点の間でのみ実行されます。
さらに、計算の複雑さを増加させることなくオブジェクトクエリの数を増やすことを可能にするクエリ集約メカニズムを提案します。
私たちは、挑戦的な YCB-Video データセットで提案されたモデルを評価し、最先端の結果を報告します。

要約(オリジナル)

Object pose estimation is a long-standing problem in computer vision. Recently, attention-based vision transformer models have achieved state-of-the-art results in many computer vision applications. Exploiting the permutation-invariant nature of the attention mechanism, a family of vision transformer models formulate multi-object pose estimation as a set prediction problem. However, existing vision transformer models for multi-object pose estimation rely exclusively on the attention mechanism. Convolutional neural networks, on the other hand, hard-wire various inductive biases into their architecture. In this paper, we investigate incorporating inductive biases in vision transformer models for multi-object pose estimation, which facilitates learning long-range dependencies while circumventing the costly global attention. In particular, we use multi-resolution deformable attention, where the attention operation is performed only between a few deformed reference points. Furthermore, we propose a query aggregation mechanism that enables increasing the number of object queries without increasing the computational complexity. We evaluate the proposed model on the challenging YCB-Video dataset and report state-of-the-art results.

arxiv情報

著者	Arul Selvam Periyasamy,Vladimir Tsaturyan,Sven Behnke
発行日	2023-12-13 16:30:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Multi-Object Pose Estimation using Multi-Resolution Deformable Attention and Query Aggregation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー