Accessing Vision Foundation Models via ImageNet-1K

要約

Vision Foundationモデルは、大規模なトレーニングデータにより、一般化能力で有名です。
それにもかかわらず、彼らは途方もないトレーニングリソースを要求し、トレーニングデータはしばしばアクセスできません。たとえば、クリップ、DINOV2は、研究を促進できるデリバティブの開発に大きな課題をもたらします。
この作業では、\ textit {proteus}という名前の非常にシンプルで一般的なソリューションを提供し、元のトレーニングデータにアクセスすることなく、基礎モデルをImagenet-1Kのより小さな同等物に蒸留します。
具体的には、データセットバイアスをもたらす従来の知識蒸留設定から設計を削除し、3つのレベルのトレーニング目標、つまりトークン、パッチ、機能を提示して、知識伝達の有効性を最大化します。
このようにして、Proteusは驚くべき能力でImagenetレベルのコストで訓練され、より広範な研究コミュニティのための基礎モデルのトレーニングのアクセシビリティを促進します。
Dinov2-G/14を教師として活用する場合、Proteus-L/14は、19のベンチマークでOracle Method Dinov2-L/14（142mトレーニングデータ）のパフォーマンスに一致し、Clip-L/14（400mを含む他のVision Foundationモデルを上回ります
）、OpenClip-L/14（400m/2b）およびSynclr-L/14（600m）は、1.2m画像のかなり小さいトレーニングセットを備えています。

要約(オリジナル)

Vision foundation models are renowned for the generalization ability due to massive training data. Nevertheless, they demand tremendous training resources, and the training data is often inaccessible, e.g., CLIP, DINOv2, posing great challenges to developing derivatives that could facilitate the research. In this work, we offer a very simple and general solution, named \textit{Proteus}, to distill foundation models into smaller equivalents on ImageNet-1K without access to the original training data. Specifically, we remove the designs from conventional knowledge distillation settings that result in dataset bias and present three levels of training objectives, i.e., token, patch, and feature, to maximize the efficacy of knowledge transfer. In this manner, Proteus is trained at ImageNet-level costs with surprising ability, facilitating the accessibility of training foundation models for the broader research community. When leveraging DINOv2-g/14 as the teacher, Proteus-L/14 matches the performance of the Oracle method DINOv2-L/14 (142M training data) across 19 benchmarks and outperforms other vision foundation models including CLIP-L/14 (400M), OpenCLIP-L/14 (400M/2B) and SynCLR-L/14 (600M) with a significantly smaller training set of 1.2M images.

arxiv情報

著者	Yitian Zhang,Xu Ma,Yue Bai,Huan Wang,Yun Fu
発行日	2025-02-11 18:44:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Accessing Vision Foundation Models via ImageNet-1K

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー