Compress image to patches for Vision Transformer

要約

ビジョントランス（VIT）は、コンピュータービジョンの分野で大きな進歩を遂げました。
ただし、モデルの深さと入力画像の解像度が増加するにつれて、トレーニングとランニングVITモデルに関連する計算コストが劇的に急増しています。この論文は、CI2P-VITという名前のCNNとVision Transformerに基づくハイブリッドモデルを提案しています。
このモデルには、CI2Pと呼ばれるモジュールが組み込まれています。CI2Pは、Compressaiエンコーダーを使用して画像を圧縮し、その後、一連の畳み込みを介して一連のパッチを生成します。
CI2PはVITモデルのパッチ埋め込みコンポーネントを置き換えることができ、VIT-B/16と格付けされた既存のVITモデルへのシームレスな統合を可能にします。CI2P-VITには、自己触媒層に入力されたパッチの数が元のオリジナルの4分の1に減少します。
この設計は、VITモデルの計算コストを大幅に削減するだけでなく、CNNの誘導バイアス特性を導入することにより、モデルの精度を効果的に向上させます。VITモデルの精度は著しく強化されます。
、CI2P-vitは92.37％の精度を達成し、VIT-B/16ベースラインよりも3.3％の改善を表しました。
さらに、1秒あたりの浮動小数点操作（FLOPS）で測定されたモデルの計算操作は63.35％減少し、同一のハードウェア構成でトレーニング速度が2倍増加しました。

要約(オリジナル)

The Vision Transformer (ViT) has made significant strides in the field of computer vision. However, as the depth of the model and the resolution of the input images increase, the computational cost associated with training and running ViT models has surged dramatically.This paper proposes a hybrid model based on CNN and Vision Transformer, named CI2P-ViT. The model incorporates a module called CI2P, which utilizes the CompressAI encoder to compress images and subsequently generates a sequence of patches through a series of convolutions. CI2P can replace the Patch Embedding component in the ViT model, enabling seamless integration into existing ViT models.Compared to ViT-B/16, CI2P-ViT has the number of patches input to the self-attention layer reduced to a quarter of the original.This design not only significantly reduces the computational cost of the ViT model but also effectively enhances the model’s accuracy by introducing the inductive bias properties of CNN.The ViT model’s precision is markedly enhanced.When trained from the ground up on the Animals-10 dataset, CI2P-ViT achieved an accuracy rate of 92.37%, representing a 3.3% improvement over the ViT-B/16 baseline. Additionally, the model’s computational operations, measured in floating-point operations per second (FLOPs), were diminished by 63.35%, and it exhibited a 2-fold increase in training velocity on identical hardware configurations.

arxiv情報

著者	Xinfeng Zhao,Yaoru Sun
発行日	2025-02-14 12:40:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Compress image to patches for Vision Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー