CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection

要約

既存のバイモーダル (RGB-D および RGB-T) 顕著な物体検出方法のほとんどは、畳み込み演算を利用し、複雑なインターウィーブ融合構造を構築して、クロスモーダル情報統合を実現します。
畳み込み演算の固有のローカル接続性により、畳み込みベースの方法のパフォーマンスが上限に制限されます。
この作業では、これらのタスクをグローバルな情報の調整と変換の観点から再考します。
具体的には、提案された \underline{c}ross-mod\underline{a}l \underline{v}iew-mixed transform\underline{er} (CAVER) は、いくつかのクロスモーダル統合ユニットをカスケードして、トップダウントランスフォーマーを構築します。
ベースの情報伝達経路。
CAVER は、マルチスケールおよびマルチモーダル機能の統合を、新しいビュー混合アテンションメカニズムに基づいて構築された、シーケンスからシーケンスへのコンテキストの伝播および更新プロセスとして扱います。
その上、二次複雑性 w.r.t を考慮します。
入力トークンの数に応じて、操作を簡素化するために、パラメーターを使用しないパッチごとのトークンの再埋め込み戦略を設計します。
RGB-D および RGB-T SOD データセットに関する広範な実験結果は、提案されたコンポーネントが装備されている場合、このような単純な 2 ストリームエンコーダー/デコーダーフレームワークが最近の最先端の方法を凌駕できることを示しています。
コードと事前トレーニング済みのモデルは、\href{https://github.com/lartpang/CAVER}{リンク} で入手できます。

要約(オリジナル)

Most of the existing bi-modal (RGB-D and RGB-T) salient object detection methods utilize the convolution operation and construct complex interweave fusion structures to achieve cross-modal information integration. The inherent local connectivity of the convolution operation constrains the performance of the convolution-based methods to a ceiling. In this work, we rethink these tasks from the perspective of global information alignment and transformation. Specifically, the proposed \underline{c}ross-mod\underline{a}l \underline{v}iew-mixed transform\underline{er} (CAVER) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path. CAVER treats the multi-scale and multi-modal feature integration as a sequence-to-sequence context propagation and update process built on a novel view-mixed attention mechanism. Besides, considering the quadratic complexity w.r.t. the number of input tokens, we design a parameter-free patch-wise token re-embedding strategy to simplify operations. Extensive experimental results on RGB-D and RGB-T SOD datasets demonstrate that such a simple two-stream encoder-decoder framework can surpass recent state-of-the-art methods when it is equipped with the proposed components. Code and pretrained models will be available at \href{https://github.com/lartpang/CAVER}{the link}.

arxiv情報

著者	Youwei Pang,Xiaoqi Zhao,Lihe Zhang,Huchuan Lu
発行日	2023-02-16 13:19:12+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CAVER: Cross-Modal View-Mixed Transformer for Bi-Modal Salient Object Detection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー