Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

要約

私たちは、自己監視で事前トレーニングされたビジョントランスフォーマー (ViT) を活用することで、弱く監視された少数ショット画像の分類とセグメンテーションのタスクに取り組みます。
私たちが提案する方法は、自己教師あり ViT からトークン表現を取得し、自己注意を介してそれらの相関関係を利用して、別個のタスクヘッドを通じて分類とセグメンテーションの予測を生成します。
私たちのモデルは、トレーニング中にピクセルレベルのラベルがない場合でも、画像レベルのラベルのみを使用して、分類とセグメンテーションを実行する方法を効果的に学習できます。
これを行うために、自己監視型 ViT バックボーンによって生成されたトークンから作成されたアテンションマップをピクセルレベルの疑似ラベルとして使用します。
また、少数のトレーニング画像にはグラウンドトゥルースのピクセルレベルのラベルが含まれ、残りの画像には画像レベルのラベルのみが含まれる「混合」監視を使用した実際的なセットアップも検討します。
この混合セットアップでは、利用可能なグラウンドトゥルースのピクセルレベルラベルを使用してトレーニングされた擬似ラベルエンハンサーを使用して擬似ラベルを改善することを提案します。
Pascal-5i および COCO-20i での実験では、さまざまな監視設定で、特にピクセルレベルのラベルがほとんどまたはまったく使用できない場合に、パフォーマンスが大幅に向上することが実証されています。

要約(オリジナル)

We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with “mixed’ supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available.

arxiv情報

著者	Dahyun Kang,Piotr Koniusz,Minsu Cho,Naila Murray
発行日	2023-07-07 06:16:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー