Addressing Sample Inefficiency in Multi-View Representation Learning

要約

BarlowTwins や VICReg などの非対比自己教師あり学習 (NC-SSL) 手法は、コンピュータービジョンにおけるラベルフリー表現学習に大きな期待を寄せています。
これらの技術は一見単純であるにもかかわらず、研究者は競争力のあるパフォーマンスを達成するためにいくつかの経験的ヒューリスティックに依存する必要があり、特に高次元のプロジェクターヘッドと同じ画像の 2 つの拡張を使用する必要があります。
この研究では、これらのヒューリスティックを説明し、より原則に基づいた推奨事項の開発を導くことができる、BarlowTwins および VICReg 損失の暗黙的なバイアスに関する理論的洞察を提供します。
私たちの最初の洞察は、優れた表現を学習するには、プロジェクターの次元よりも特徴の直交性の方が重要であるということです。
これに基づいて、既存のヒューリスティックに反して、適切な正則化を行うことで低次元のプロジェクターヘッドで十分であることを経験的に示します。
2 番目の理論的洞察は、複数のデータ拡張を使用する方が SSL の目的の要望をより適切に表現できることを示唆しています。
これに基づいて、サンプルごとにより多くの拡張を活用することで表現の品質とトレーニング可能性が向上することを実証します。
特に、最適化の収束が向上し、トレーニングの早い段階でより良い機能が現れるようになります。
驚くべきことに、より多くのデータ拡張を使用するだけで、精度を維持し、収束を向上させながら、事前トレーニングデータセットのサイズを最大 4 分の 1 に削減できることを実証しました。
これらの洞察を組み合わせて、実時間を 2 倍に改善し、ResNet-50 バックボーンを使用して CIFAR-10/STL-10 データセットのパフォーマンスを向上させる、実践的な事前トレーニングの推奨事項を示します。
したがって、この研究は NC-SSL に対する理論的な洞察を提供し、サンプルと計算の効率を高めるための実践的な推奨事項を生成します。

要約(オリジナル)

Non-contrastive self-supervised learning (NC-SSL) methods like BarlowTwins and VICReg have shown great promise for label-free representation learning in computer vision. Despite the apparent simplicity of these techniques, researchers must rely on several empirical heuristics to achieve competitive performance, most notably using high-dimensional projector heads and two augmentations of the same image. In this work, we provide theoretical insights on the implicit bias of the BarlowTwins and VICReg loss that can explain these heuristics and guide the development of more principled recommendations. Our first insight is that the orthogonality of the features is more critical than projector dimensionality for learning good representations. Based on this, we empirically demonstrate that low-dimensional projector heads are sufficient with appropriate regularization, contrary to the existing heuristic. Our second theoretical insight suggests that using multiple data augmentations better represents the desiderata of the SSL objective. Based on this, we demonstrate that leveraging more augmentations per sample improves representation quality and trainability. In particular, it improves optimization convergence, leading to better features emerging earlier in the training. Remarkably, we demonstrate that we can reduce the pretraining dataset size by up to 4x while maintaining accuracy and improving convergence simply by using more data augmentations. Combining these insights, we present practical pretraining recommendations that improve wall-clock time by 2x and improve performance on CIFAR-10/STL-10 datasets using a ResNet-50 backbone. Thus, this work provides a theoretical insight into NC-SSL and produces practical recommendations for enhancing its sample and compute efficiency.

arxiv情報

著者	Kumar Krishna Agrawal,Arna Ghosh,Adam Oberman,Blake Richards
発行日	2023-12-17 14:14:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Addressing Sample Inefficiency in Multi-View Representation Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー