SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

要約

スパイクニューラルネットワーク（SNN）は、さまざまなビジョンタスクで人工ニューラルネットワーク（ANN）に競争力のあるパフォーマンスを示し、優れたエネルギー効率を提供しています。
ただし、既存のSNNベースの変圧器は主に単一画像タスクに焦点を当てており、空間機能を強調しながら、ビデオベースのビジョンタスクにおけるSNNSの効率を効果的に活用しません。
このペーパーでは、リニア時間的複雑さ$ \ mathcal {o}（t）$を特徴とする効率的なスパイク駆動型のビデオトランスであるSpikevideoformerを紹介します。
具体的には、スパイク駆動型のハミング注意（SDHA）を設計します。これは、従来の実質的な注意からスパイク駆動型の注意に理論的に誘導された適応を提供します。
SDHAに基づいて、さまざまなスパイク駆動型の時空の注意設計をさらに分析し、ビデオタスクに魅力的なパフォーマンスを提供する最適なスキームを特定しながら、線形の時間的複雑さのみを維持します。
モデルの一般化能力と効率は、分類、人間のポーズ追跡、セマンティックセグメンテーションなど、多様な下流のビデオタスク全体で実証されています。
経験的な結果は、私たちの方法が既存のSNNアプローチと比較して最先端（SOTA）のパフォーマンスを達成し、後者の2つのタスクに15を超える改善を示していることを示しています。
さらに、最近のANNベースの方法のパフォーマンスと一致しながら、大幅な効率向上を提供し、3つのタスクで$ 16 $、$ \ Times 10 $、および$ \ Times 5 $の改善を達成します。
https://github.com/jimmyzou/spikevideoformer

要約(オリジナル)

Spiking Neural Networks (SNNs) have shown competitive performance to Artificial Neural Networks (ANNs) in various vision tasks, while offering superior energy efficiency. However, existing SNN-based Transformers primarily focus on single-image tasks, emphasizing spatial features while not effectively leveraging SNNs’ efficiency in video-based vision tasks. In this paper, we introduce SpikeVideoFormer, an efficient spike-driven video Transformer, featuring linear temporal complexity $\mathcal{O}(T)$. Specifically, we design a spike-driven Hamming attention (SDHA) which provides a theoretically guided adaptation from traditional real-valued attention to spike-driven attention. Building on SDHA, we further analyze various spike-driven space-time attention designs and identify an optimal scheme that delivers appealing performance for video tasks, while maintaining only linear temporal complexity. The generalization ability and efficiency of our model are demonstrated across diverse downstream video tasks, including classification, human pose tracking, and semantic segmentation. Empirical results show our method achieves state-of-the-art (SOTA) performance compared to existing SNN approaches, with over 15\% improvement on the latter two tasks. Additionally, it matches the performance of recent ANN-based methods while offering significant efficiency gains, achieving $\times 16$, $\times 10$ and $\times 5$ improvements on the three tasks. https://github.com/JimmyZou/SpikeVideoFormer

arxiv情報

著者	Shihao Zou,Qingfeng Li,Wei Ji,Jingjing Li,Yongkui Yang,Guoqi Li,Chao Dong
発行日	2025-05-15 14:43:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SpikeVideoFormer: An Efficient Spike-Driven Video Transformer with Hamming Attention and $\mathcal{O}(T)$ Complexity

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー