PPT: Token Pruning and Pooling for Efficient Vision Transformers

要約

ビジョントランスフォーマー (ViT) は、コンピュータービジョンの分野で強力なモデルとして登場し、さまざまなビジョンタスクにわたって優れたパフォーマンスを提供します。
ただし、計算の複雑さが高いため、現実世界のシナリオでの実用化には大きな障壁となります。
すべてのトークンが最終予測に均等に寄与するわけではなく、トークンが少ないほど計算コストが削減されるという事実を動機として、冗長トークンを削減することが、ビジョントランスフォーマーを加速するための一般的なパラダイムとなっています。
しかし、トークンプルーニングによって不注意な冗長性を減らすだけ、またはトークンのマージによって重複する冗長性だけを減らすのは最適ではないと主張します。
この目的を達成するために、この論文では、異なる層でこれら 2 つのタイプの冗長性に適応的に取り組むための、新しい加速フレームワーク、つまりトークンプルーニング & プーリングトランスフォーマー (PPT) を提案します。
追加のトレーニング可能なパラメーターを使用せずに、トークンプルーニングとトークンプーリングの両方の手法を ViT にヒューリスティックに統合することで、PPT は予測精度を維持しながらモデルの複雑さを効果的に軽減します。
たとえば、PPT は ImageNet データセットの精度を低下させることなく、DeiT-S の FLOP を 37% 以上削減し、スループットを 45% 以上向上させます。
コードは https://github.com/xjwu1024/PPT および https://github.com/mindspore-lab/models/ で入手できます。

要約(オリジナル)

Vision Transformers (ViTs) have emerged as powerful models in the field of computer vision, delivering superior performance across various vision tasks. However, the high computational complexity poses a significant barrier to their practical applications in real-world scenarios. Motivated by the fact that not all tokens contribute equally to the final predictions and fewer tokens bring less computational cost, reducing redundant tokens has become a prevailing paradigm for accelerating vision transformers. However, we argue that it is not optimal to either only reduce inattentive redundancy by token pruning, or only reduce duplicative redundancy by token merging. To this end, in this paper we propose a novel acceleration framework, namely token Pruning & Pooling Transformers (PPT), to adaptively tackle these two types of redundancy in different layers. By heuristically integrating both token pruning and token pooling techniques in ViTs without additional trainable parameters, PPT effectively reduces the model complexity while maintaining its predictive accuracy. For example, PPT reduces over 37% FLOPs and improves the throughput by over 45% for DeiT-S without any accuracy drop on the ImageNet dataset. The code is available at https://github.com/xjwu1024/PPT and https://github.com/mindspore-lab/models/

arxiv情報

著者	Xinjian Wu,Fanhu Zeng,Xiudong Wang,Yunhe Wang,Xinghao Chen
発行日	2024-01-17 14:04:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PPT: Token Pruning and Pooling for Efficient Vision Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー