DINOv2: Learning Robust Visual Features without Supervision

要約

タイトル：DINOv2：教示なしで堅牢な視覚的特徴を学習する

要約：

– 大量のデータに対するモデル事前学習に関する自然言語処理の最近の突破は、コンピュータビジョンの同様の基礎モデルへの道を開いています。
– これらのモデルは、微調整を必要とせずに、画像を任意のシステムで使用することを大幅に簡素化することができる、つまり、画像分布とタスク全般にわたる機能を持つビジュアル特徴を生成することができます。
– この研究は、既存の事前学習方法、特に自己教示法を用いた方法が、多様なソースからキュレーションされたデータによって訓練される場合、そのような特徴を生成できることを示しています。
– 我々は、既存の手法を再検討し、さまざまな技術を組み合わせて、データとモデルサイズにおける事前学習をスケールアップしています。
– 技術的貢献のほとんどは、規模でのトレーニングを加速し、安定化することを目的としています。
– データに関しては、通常は未加工データとして扱われるのではなく、専用の多様かつキュレーションされた画像データセットを構築する自動パイプラインを提案しています。
– モデルに関しては、1Bのパラメータを持つViTモデル（Dosovitskiy et al.、2020）を訓練し、より小さなモデルのシリーズに蒸留することで、最も優れた万能特徴であるOpenCLIP（Ilharco et al.、2021）を画像レベルとピクセルレベルのほとんどのベンチマークで上回ることができます。

要約(オリジナル)

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

arxiv情報

著者	Maxime Oquab,Timothée Darcet,Théo Moutakanni,Huy Vo,Marc Szafraniec,Vasil Khalidov,Pierre Fernandez,Daniel Haziza,Francisco Massa,Alaaeldin El-Nouby,Mahmoud Assran,Nicolas Ballas,Wojciech Galuba,Russell Howes,Po-Yao Huang,Shang-Wen Li,Ishan Misra,Michael Rabbat,Vasu Sharma,Gabriel Synnaeve,Hu Xu,Hervé Jegou,Julien Mairal,Patrick Labatut,Armand Joulin,Piotr Bojanowski
発行日	2023-04-14 15:12:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

DINOv2: Learning Robust Visual Features without Supervision

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー