CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

要約

トランスフォーマーや CLIP などのビジョン言語モデル (VLM) の出現により、大規模な事前トレーニング済みモデルを微調整することは、最近、継続学習において一般的な戦略となっています。
これにより、壊滅的な忘却を招くことなく変圧器ベースのモデルを適応させるための数多くの促進戦略が開発されました。
ただし、これらの戦略は、多くの場合、事前トレーニングされた CLIP モデルの元のゼロショット機能を損ない、事前トレーニングデータから大幅に逸脱するドメインに適応するのに苦労します。
この研究では、CLIP を適応させながら物忘れを軽減するためのシンプルで斬新なアプローチである、増分プロンプト学習のための継続的生成トレーニングを提案します。
簡単に言うと、変分オートエンコーダー (VAE) を使用して、ビジュアルエンコーダーの埋め込み空間内のクラス条件付き分布を学習します。
次に、これらのディストリビューションを利用して、新しい合成視覚的埋め込みをサンプリングし、後続のタスク中に対応するクラス固有のテキストプロンプトをトレーニングします。
さまざまなドメインでの広範な実験を通じて、このような生成再生アプローチが、CL シナリオに合わせた新しい指標を使用して評価されるゼロショット機能を向上させながら、新しいタスクに適応できることを示します。
特に、さらなる分析により、私たちのアプローチが共同プロンプトチューニングでギャップを埋めることができることが明らかになりました。
コードベースは https://github.com/aimagelab/mammoth で入手できます。

要約(オリジナル)

With the emergence of Transformers and Vision-Language Models (VLMs) such as CLIP, fine-tuning large pre-trained models has recently become a prevalent strategy in Continual Learning. This has led to the development of numerous prompting strategies to adapt transformer-based models without incurring catastrophic forgetting. However, these strategies often compromise the original zero-shot capabilities of the pre-trained CLIP model and struggle to adapt to domains that significantly deviate from the pre-training data. In this work, we propose Continual Generative training for Incremental prompt-Learning, a simple and novel approach to mitigate forgetting while adapting CLIP. Briefly, we employ Variational Autoencoders (VAEs) to learn class-conditioned distributions within the embedding space of the visual encoder. We then exploit these distributions to sample new synthetic visual embeddings and train the corresponding class-specific textual prompts during subsequent tasks. Through extensive experiments on different domains, we show that such a generative replay approach can adapt to new tasks while improving zero-shot capabilities, evaluated using a novel metric tailored for CL scenarios. Notably, further analysis reveals that our approach can bridge the gap with joint prompt tuning. The codebase is available at https://github.com/aimagelab/mammoth.

arxiv情報

著者	Emanuele Frascaroli,Aniello Panariello,Pietro Buzzega,Lorenzo Bonicelli,Angelo Porrello,Simone Calderara
発行日	2024-10-28 12:41:35+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CLIP with Generative Latent Replay: a Strong Baseline for Incremental Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー