Centroid-centered Modeling for Efficient Vision Transformer Pre-training

要約

マスクイメージモデリング (MIM) は、ビジョントランスフォーマー (ViT) を使用した新しい自己教師ありビジョン事前トレーニングパラダイムです。
以前の作品は、ピクセルベースまたはトークンベースで、それぞれオリジナルのピクセルまたはパラメトリックトークナイザーモデルからの個別のビジュアルトークンを使用することができます。
私たちが提案する重心ベースのアプローチである CCViT は、K-means クラスタリングを利用して、トークナイザーモデルの教師ありトレーニングを行わずに画像モデリングの重心を取得します。作成には数秒しかかかりません。
このノンパラメトリックセントロイドトークナイザーは作成に数秒しかかからず、トークン推論が高速になります。
重心は、局所不変性の特性を持つパッチピクセルとインデックストークンの両方を表すことができます。
具体的には、パッチマスキングとセントロイド置換戦略を採用して破損した入力を構築し、2 つのスタックされたエンコーダブロックを使用して破損したパッチトークンを予測し、元のパッチピクセルを再構築します。
実験の結果、当社の CCViT は、ViT-B を使用した ImageNet-1K 分類で 84.4%、ViT-L を使用した場合は 86.0% のトップ 1 精度を達成したことが示されています。
また、事前トレーニングされたモデルを他の下流タスクに転送します。
私たちのアプローチは、外部の監督や他のモデルからの蒸留トレーニングなしで、最近のベースラインで競争力のある結果を達成します。

要約(オリジナル)

Masked Image Modeling (MIM) is a new self-supervised vision pre-training paradigm using a Vision Transformer (ViT). Previous works can be pixel-based or token-based, using original pixels or discrete visual tokens from parametric tokenizer models, respectively. Our proposed centroid-based approach, CCViT, leverages k-means clustering to obtain centroids for image modeling without supervised training of the tokenizer model, which only takes seconds to create. This non-parametric centroid tokenizer only takes seconds to create and is faster for token inference. The centroids can represent both patch pixels and index tokens with the property of local invariance. Specifically, we adopt patch masking and centroid replacing strategies to construct corrupted inputs, and two stacked encoder blocks to predict corrupted patch tokens and reconstruct original patch pixels. Experiments show that our CCViT achieves 84.4% top-1 accuracy on ImageNet-1K classification with ViT-B and 86.0% with ViT-L. We also transfer our pre-trained model to other downstream tasks. Our approach achieves competitive results with recent baselines without external supervision and distillation training from other models.

arxiv情報

著者	Xin Yan,Zuchao Li,Lefei Zhang
発行日	2024-08-01 08:39:23+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Centroid-centered Modeling for Efficient Vision Transformer Pre-training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー