Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

要約

トランスベースのモデルの強化により、視覚的な追跡が大幅に進歩しました。
ただし、現在のトラッカーは速度が遅いため、計算リソースに制約のあるデバイスへの適用性が制限されます。
この課題に対処するために、変換ブロックを適応的にバイパスして効率的な視覚追跡を実現する適応型計算フレームワークである ABTrack を導入します。
ABTrack の背後にある理論的根拠は、意味論的な特徴や関係がすべての抽象化レベルにわたって追跡タスクに均一に影響を与えるわけではないという観察に基づいています。
その代わり、この影響はターゲットの特性とそれが占めるシーンに基づいて変化します。
したがって、特定の抽象化レベルで重要ではない意味論的な特徴や関係を無視しても、追跡精度に大きな影響を与えない可能性があります。
トランスブロックをバイパスする必要があるかどうかを決定するバイパス決定モジュール (BDM) を提案します。これにより、ViT のアーキテクチャが適応的に簡素化され、推論プロセスが高速化されます。
BDM によって発生する時間コストに対処し、ViT の効率をさらに高めるために、各トランスフォーマーブロック内のトークンの潜在表現の次元を削減する新しい ViT プルーニング手法を導入します。
複数の追跡ベンチマークに関する広範な実験により、提案された方法の有効性と一般性が検証され、最先端のパフォーマンスが達成されることが示されています。
コードは https://github.com/xyyang317/ABTrack でリリースされています。

要約(オリジナル)

Empowered by transformer-based models, visual tracking has advanced significantly. However, the slow speed of current trackers limits their applicability on devices with constrained computational resources. To address this challenge, we introduce ABTrack, an adaptive computation framework that adaptively bypassing transformer blocks for efficient visual tracking. The rationale behind ABTrack is rooted in the observation that semantic features or relations do not uniformly impact the tracking task across all abstraction levels. Instead, this impact varies based on the characteristics of the target and the scene it occupies. Consequently, disregarding insignificant semantic features or relations at certain abstraction levels may not significantly affect the tracking accuracy. We propose a Bypass Decision Module (BDM) to determine if a transformer block should be bypassed, which adaptively simplifies the architecture of ViTs and thus speeds up the inference process. To counteract the time cost incurred by the BDMs and further enhance the efficiency of ViTs, we introduce a novel ViT pruning method to reduce the dimension of the latent representation of tokens in each transformer block. Extensive experiments on multiple tracking benchmarks validate the effectiveness and generality of the proposed method and show that it achieves state-of-the-art performance. Code is released at: https://github.com/xyyang317/ABTrack.

arxiv情報

著者	Xiangyang Yang,Dan Zeng,Xucheng Wang,You Wu,Hengzhou Ye,Qijun Zhao,Shuiwang Li
発行日	2024-07-01 12:03:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Adaptively Bypassing Vision Transformer Blocks for Efficient Visual Tracking

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー