Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

要約

ビジョントランスフォーマー (ViTs) により、ビジョンタスクでのトランスフォーマーアーキテクチャの使用が可能になり、大きなデータセットでトレーニングしたときに印象的なパフォーマンスを示しました。
ただし、比較的小さなデータセットでは、誘導バイアスがないため、ViT の精度は低くなります。
この目的のために、外部注釈や外部データなしで結果を大幅に改善できる、ViT をトレーニングするためのシンプルだが効果的な自己教師あり学習 (SSL) 戦略を提案します。
具体的には、モデルが教師付きタスクの前または共同で解決しなければならない画像パッチの関係に基づいて、一連の SSL タスクを定義します。
ViT とは異なり、RelViT モデルは、画像パッチに関連する変換エンコーダーのすべての出力トークンを最適化し、各トレーニングステップでより多くのトレーニング信号を利用します。
いくつかの画像ベンチマークでメソッドを調査した結果、RelViT が SSL の最先端のメソッドを大幅に改善し、特に小さなデータセットで顕著であることがわかりました。
コードは https://github.com/guglielmocamporese/relvit で入手できます。

要約(オリジナル)

Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.

arxiv情報

著者	Guglielmo Camporese,Elena Izzo,Lamberto Ballan
発行日	2022-10-13 14:11:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー