CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

要約

ビジョントランスフォーマー (ViT) は、さまざまな視覚タスクのための最先端のモデルとして最近登場しました。
ただし、リソースが限られたデバイスにとっては、膨大な計算コストが依然として困難です。
そのため、研究者は高速化のために ViT 内の冗長情報を圧縮することに専念してきました。
ただし、これらは一般に、トークンプルーニングによって冗長なイメージトークンをまばらに削除したり、チャネルプルーニングによってチャネルを徹底的に削除したりするため、モデルのパフォーマンスと推論速度のバランスが最適化されていません。
また、セマンティックセグメンテーションなど、画像の空間構造を必要とする下流のビジョンタスクに圧縮モデルを転送する場合にも不利になります。
これらの課題に対処するために、我々は、下流タスク（CAIT）への良好な転送性を維持しながら、高精度と高速推論速度の両方を提供するViTの統合圧縮手法を提案します。
具体的には、隣接するトークンを効果的に統合する非対称トークンマージ (ATME) 戦略を導入します。
画像の空間構造を維持しながら、冗長なトークン情報を正常に圧縮できます。
さらに、一貫した動的チャネルプルーニング (CDCP) 戦略を採用して、ViT 内の重要でないチャネルを動的にプルーニングします。
CDCP のおかげで、ViT のマルチヘッドセルフアテンションモジュール内の重要ではないチャネルを均一に枝刈りすることができ、モデル圧縮が大幅に強化されます。
ベンチマークデータセットに対する広範な実験により、私たちが提案した手法がさまざまな ViT にわたって最先端のパフォーマンスを達成できることが実証されました。
たとえば、プルーニングされた DeiT-Tiny と DeiT-Small は、ImageNet で精度を低下させることなく、それぞれ 1.7$\times$ と 1.9$\times$ の高速化を達成しました。
ADE20k セグメンテーションデータセットでは、私たちの方法は同等の mIoU で最大 1.31$\times$ の高速化を実現できます。
私たちのコードは公開される予定です。

要約(オリジナル)

Vision Transformers (ViTs) have emerged as state-of-the-art models for various vision tasks recently. However, their heavy computation costs remain daunting for resource-limited devices. Consequently, researchers have dedicated themselves to compressing redundant information in ViTs for acceleration. However, they generally sparsely drop redundant image tokens by token pruning or brutally remove channels by channel pruning, leading to a sub-optimal balance between model performance and inference speed. They are also disadvantageous in transferring compressed models to downstream vision tasks that require the spatial structure of images, such as semantic segmentation. To tackle these issues, we propose a joint compression method for ViTs that offers both high accuracy and fast inference speed, while also maintaining favorable transferability to downstream tasks (CAIT). Specifically, we introduce an asymmetric token merging (ATME) strategy to effectively integrate neighboring tokens. It can successfully compress redundant token information while preserving the spatial structure of images. We further employ a consistent dynamic channel pruning (CDCP) strategy to dynamically prune unimportant channels in ViTs. Thanks to CDCP, insignificant channels in multi-head self-attention modules of ViTs can be pruned uniformly, greatly enhancing the model compression. Extensive experiments on benchmark datasets demonstrate that our proposed method can achieve state-of-the-art performance across various ViTs. For example, our pruned DeiT-Tiny and DeiT-Small achieve speedups of 1.7$\times$ and 1.9$\times$, respectively, without accuracy drops on ImageNet. On the ADE20k segmentation dataset, our method can enjoy up to 1.31$\times$ speedups with comparable mIoU. Our code will be publicly available.

arxiv情報

著者	Ao Wang,Hui Chen,Zijia Lin,Sicheng Zhao,Jungong Han,Guiguang Ding
発行日	2023-09-27 16:12:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CAIT: Triple-Win Compression towards High Accuracy, Fast Inference, and Favorable Transferability For ViTs

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー