Centroid-centered Modeling for Efficient Vision Transformer Pre-training

要約

Masked Image Modeling (MIM) は、Vision Transformer (ViT) を使用した新しい自己教師ありビジョン事前トレーニングパラダイムです。
以前の作品は、それぞれ元のピクセルまたはパラメトリックトークナイザーモデルからの個別のビジュアルトークンを使用して、ピクセルベースまたはトークンベースにすることができます。
私たちが提案するアプローチ \textbf{CCViT} は、k-means クラスタリングを活用して、トークナイザーモデルの教師付きトレーニングなしで画像モデリングの重心を取得します。
重心は、パッチピクセルとインデックストークンを表し、局所不変性の特性を持っています。
ノンパラメトリックセントロイドトークナイザーは、作成に数秒しかかからず、トークンの推論が高速です。
具体的には、パッチマスキングとセントロイド置換戦略を採用して破損した入力を構築し、2 つの積み重ねられたエンコーダブロックを使用して破損したパッチトークンを予測し、元のパッチピクセルを再構築します。
実験では、300 エポックのみの ViT-B モデルが、ImageNet-1K 分類で 84.3\% のトップ 1 精度、ADE20K セマンティックセグメンテーションで 51.6\% を達成することが示されています。
私たちのアプローチは、他のモデルからの蒸留トレーニングなしで BEiTv2 で競争力のある結果を達成し、MAE などの他の方法よりも優れています。

要約(オリジナル)

Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using Vision Transformer (ViT). Previous works can be pixel-based or token-based, using original pixels or discrete visual tokens from parametric tokenizer models, respectively. Our proposed approach, \textbf{CCViT}, leverages k-means clustering to obtain centroids for image modeling without supervised training of tokenizer model. The centroids represent patch pixels and index tokens and have the property of local invariance. Non-parametric centroid tokenizer only takes seconds to create and is faster for token inference. Specifically, we adopt patch masking and centroid replacement strategies to construct corrupted inputs, and two stacked encoder blocks to predict corrupted patch tokens and reconstruct original patch pixels. Experiments show that the ViT-B model with only 300 epochs achieves 84.3\% top-1 accuracy on ImageNet-1K classification and 51.6\% on ADE20K semantic segmentation. Our approach achieves competitive results with BEiTv2 without distillation training from other models and outperforms other methods such as MAE.

arxiv情報

著者	Xin Yan,Zuchao Li,Lefei Zhang,Bo Du,Dacheng Tao
発行日	2023-03-08 15:34:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー