Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video Recognition

要約

RGB-Eventデータに基づいたパターン認識は、新たに発生する研究トピックであり、以前の作品は通常、CNNまたはトランスを使用して機能を学習します。
私たちが知っているように、CNNはローカルの特徴をよくキャプチャし、カスケードされた自己触媒メカニズムは、長距離のグローバルな関係を抽出するのに優れています。
図〜\ ref {firstImage}に示すように、高性能のRGBイベントベースのビデオ認識のためにそれらを組み合わせることは直感的ですが、既存の作業は精度とモデルパラメーターの間の良好なバランスを達成できません。
この作業では、TSCFormerと呼ばれる新しいRGBイベントベースの認識フレームワークを提案します。これは、比較的軽量のCNNトランスフォーカーモデルです。
具体的には、主にCNNをバックボーンネットワークとして採用して、最初にRGBデータとイベントデータの両方をエンコードします。
一方、グローバルトークンを入力として初期化し、BridgeFormerモジュールを使用してRGBおよびイベント機能でそれらを融合させます。
両方のモダリティ間のグローバルな長距離関係をよく捉え、モデルアーキテクチャ全体の単純さを同時に維持します。
拡張機能は、F2EおよびF2Vモジュールを使用して、それぞれRGBおよびイベントCNNブロックに融合し、それぞれインタラクティブな方法で融合します。
他のCNNブロックに対して同様の操作が実施され、異なる解像度の下で適応融合と局所グロバル特徴の強化を実現します。
最後に、これらの3つの機能を連結し、パターン認識のためにそれらを分類ヘッドに供給します。
2つの大規模なRGBイベントベンチマークデータセット（PokereventおよびHardV）に関する広範な実験により、提案されたTSCFormerの有効性が完全に検証されました。
ソースコードと事前に訓練されたモデルは、https：//github.com/event-ahu/tscformerでリリースされます。

要約(オリジナル)

Pattern recognition based on RGB-Event data is a newly arising research topic and previous works usually learn their features using CNN or Transformer. As we know, CNN captures the local features well and the cascaded self-attention mechanisms are good at extracting the long-range global relations. It is intuitive to combine them for high-performance RGB-Event based video recognition, however, existing works fail to achieve a good balance between the accuracy and model parameters, as shown in Fig.~\ref{firstimage}. In this work, we propose a novel RGB-Event based recognition framework termed TSCFormer, which is a relatively lightweight CNN-Transformer model. Specifically, we mainly adopt the CNN as the backbone network to first encode both RGB and Event data. Meanwhile, we initialize global tokens as the input and fuse them with RGB and Event features using the BridgeFormer module. It captures the global long-range relations well between both modalities and maintains the simplicity of the whole model architecture at the same time. The enhanced features will be projected and fused into the RGB and Event CNN blocks, respectively, in an interactive manner using F2E and F2V modules. Similar operations are conducted for other CNN blocks to achieve adaptive fusion and local-global feature enhancement under different resolutions. Finally, we concatenate these three features and feed them into the classification head for pattern recognition. Extensive experiments on two large-scale RGB-Event benchmark datasets (PokerEvent and HARDVS) fully validated the effectiveness of our proposed TSCFormer. The source code and pre-trained models will be released at https://github.com/Event-AHU/TSCFormer.

arxiv情報

著者	Xiao Wang,Yao Rong,Shiao Wang,Yuan Chen,Zhe Wu,Bo Jiang,Yonghong Tian,Jin Tang
発行日	2025-04-18 10:03:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Unleashing the Power of CNN and Transformer for Balanced RGB-Event Video Recognition

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー