Do text-free diffusion models learn discriminative visual representations?

要約

多くの教師なし学習モデルは、生成または識別のいずれか 1 つのタスク群に焦点を当てていますが、私たちは、両方のタスク群に同時に対処するモデルである統一表現学習器の可能性を探ります。
私たちは、生成タスクの最先端の方法である拡散モデルを主な候補として特定します。
このようなモデルには、ノイズを繰り返し予測して除去するために U-Net をトレーニングすることが含まれており、結果として得られるモデルは、忠実度の高い、多様で斬新な画像を合成できます。
U-Net の中間特徴マップは、多様で識別的な特徴表現であることがわかりました。
特徴マップをプールするための新しいアテンションメカニズムを提案し、このメカニズムを、さまざまな拡散 U-Net ブロックとノイズステップからの特徴を融合するトランスフォーマー機能である DifFormer としてさらに活用します。
また、拡散に合わせた新しいフィードバック機構である DifFeed も開発しています。
私たちは、拡散モデルが GAN よりも優れており、私たちの融合およびフィードバックメカニズムにより、完全および半教師による画像分類、精密な転送など、識別タスクに関して最先端の教師なし画像表現学習方法と競合できることを発見しました。
-きめ細かな分類、オブジェクトの検出とセグメンテーション、セマンティックセグメンテーション。
私たちのプロジェクトの Web サイト (https://mgwillia.github.io/diffssl/) とコード (https://github.com/soumik-kanad/diffssl) は公開されています。

要約(オリジナル)

While many unsupervised learning models focus on one family of tasks, either generative or discriminative, we explore the possibility of a unified representation learner: a model which addresses both families of tasks simultaneously. We identify diffusion models, a state-of-the-art method for generative tasks, as a prime candidate. Such models involve training a U-Net to iteratively predict and remove noise, and the resulting model can synthesize high-fidelity, diverse, novel images. We find that the intermediate feature maps of the U-Net are diverse, discriminative feature representations. We propose a novel attention mechanism for pooling feature maps and further leverage this mechanism as DifFormer, a transformer feature fusion of features from different diffusion U-Net blocks and noise steps. We also develop DifFeed, a novel feedback mechanism tailored to diffusion. We find that diffusion models are better than GANs, and, with our fusion and feedback mechanisms, can compete with state-of-the-art unsupervised image representation learning methods for discriminative tasks – image classification with full and semi-supervision, transfer for fine-grained classification, object detection and segmentation, and semantic segmentation. Our project website (https://mgwillia.github.io/diffssl/) and code (https://github.com/soumik-kanad/diffssl) are available publicly.

arxiv情報

著者	Soumik Mukhopadhyay,Matthew Gwilliam,Yosuke Yamaguchi,Vatsal Agarwal,Namitha Padmanabhan,Archana Swaminathan,Tianyi Zhou,Abhinav Shrivastava
発行日	2023-11-30 03:02:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Do text-free diffusion models learn discriminative visual representations?

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー