SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

要約

高解像度の画像により、ニューラルネットワークはより豊かな視覚的表現を学習できます。
ただし、このパフォーマンスの向上には、計算の複雑さが増すという代償が伴い、レイテンシの影響を受けやすいアプリケーションでの使用が妨げられます。
すべてのピクセルが等しいわけではないため、重要度の低い領域の計算をスキップすると、計算を削減するための簡単で効果的な手段が提供されます。
ただし、これを CNN の実際の高速化に変換するのは困難です。これは、高密度の畳み込みワークロードの規則性が崩れるためです。
このホワイトペーパーでは、最近のウィンドウベースのビジョントランスフォーマー (ViT) のアクティベーションスパース性を再検討する SparseViT を紹介します。
ウィンドウのアテンションはブロック全体に自然にバッチ処理されるため、ウィンドウアクティベーションのプルーニングによる実際のスピードアップが可能になります。つまり、60% のスパース性で最大 50% のレイテンシーが削減されます。
さまざまなレイヤーには、さまざまな感度と計算コストのため、さまざまな剪定率を割り当てる必要があります。
スパース性を意識した適応を導入し、進化的検索を適用して、広大な検索空間内で最適なレイヤーごとのスパース性構成を効率的に見つけます。
SparseViT は、単眼 3D オブジェクト検出、2D インスタンスセグメンテーション、および 2D セマンティックセグメンテーションにおいて、それぞれ Dense の対応するものと比較して 1.5 倍、1.4 倍、および 1.3 倍のスピードアップを達成し、精度の損失はほとんどありません。

要約(オリジナル)

High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., ~50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy.

arxiv情報

著者	Xuanyao Chen,Zhijian Liu,Haotian Tang,Li Yi,Hang Zhao,Song Han
発行日	2023-03-30 17:59:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー