State Space Model Meets Transformer: A New Paradigm for 3D Object Detection

要約

マルチレイヤートランスデコーダーを使用してオブジェクトクエリを繰り返し改良するDETRベースの方法は、3D屋内オブジェクトの検出で有望なパフォーマンスを示しています。
ただし、トランスデコーダーのシーンポイント機能は固定されたままであり、後のデコーダー層からの最小限の寄与につながるため、パフォーマンスの改善が制限されます。
最近、State Space Models（SSM）は、システム状態と入力間の反復的相互作用を通じて、線形の複雑さを伴う効率的なコンテキストモデリング能力を示しています。
SSMSに触発されて、インタラクティブな状態空間モデル（DEST）を備えた新しい3Dオブジェクト検出パラダイムを提案します。
インタラクティブSSMでは、システム状態が3D屋内検出タスクのクエリとして効果的に機能することを可能にする新しい状態依存SSMパラメーター化方法を設計します。
さらに、Point CloudとSSMの特性に合わせた4つの重要なデザインを紹介します。シリアル化と双方向スキャン戦略により、SSM内のシーンポイント間の双方向の特徴の相互作用が可能になります。
状態間の注意メカニズムは、状態ポイント間の関係をモデル化し、ゲートフィードフォワードネットワークはチャネル間相関を強化します。
私たちの知る限り、これはシステムの状態としてクエリをモデル化し、シーンポイントをシステム入力としてモデル化する最初の方法であり、同時にシーンポイント機能と線形複雑さを伴うクエリ機能を更新できます。
2つの挑戦的なデータセットでの広範な実験は、運命ベースの方法の有効性を示しています。
私たちの方法は、Scannet V2（+5.3）およびSun RGB-D（+3.2）データセットのAP50の観点から、グループフリーのベースラインを改善します。
VDERTベースラインに基づいて、この方法はSCANNETV2およびSUN RGB-Dデータセットに新しいSOTAを設定します。

要約(オリジナル)

DETR-based methods, which use multi-layer transformer decoders to refine object queries iteratively, have shown promising performance in 3D indoor object detection. However, the scene point features in the transformer decoder remain fixed, leading to minimal contributions from later decoder layers, thereby limiting performance improvement. Recently, State Space Models (SSM) have shown efficient context modeling ability with linear complexity through iterative interactions between system states and inputs. Inspired by SSMs, we propose a new 3D object DEtection paradigm with an interactive STate space model (DEST). In the interactive SSM, we design a novel state-dependent SSM parameterization method that enables system states to effectively serve as queries in 3D indoor detection tasks. In addition, we introduce four key designs tailored to the characteristics of point cloud and SSM: The serialization and bidirectional scanning strategies enable bidirectional feature interaction among scene points within the SSM. The inter-state attention mechanism models the relationships between state points, while the gated feed-forward network enhances inter-channel correlations. To the best of our knowledge, this is the first method to model queries as system states and scene points as system inputs, which can simultaneously update scene point features and query features with linear complexity. Extensive experiments on two challenging datasets demonstrate the effectiveness of our DEST-based method. Our method improves the GroupFree baseline in terms of AP50 on ScanNet V2 (+5.3) and SUN RGB-D (+3.2) datasets. Based on the VDETR baseline, Our method sets a new SOTA on the ScanNetV2 and SUN RGB-D datasets.

arxiv情報

著者	Chuxin Wang,Wenfei Yang,Xiang Liu,Tianzhu Zhang
発行日	2025-03-19 14:10:18+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

State Space Model Meets Transformer: A New Paradigm for 3D Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー