Learning without Forgetting for Vision-Language Models

要約

クラス増分学習 (CIL) または継続学習は、現実世界では望ましい機能であり、以前のタスクを忘れずに新しいタスクに適応する学習システムが必要です。
従来の CIL 手法は中核的な特徴を把握するために視覚情報に重点を置いていますが、視覚言語モデル (VLM) の最近の進歩により、テキスト情報を利用して一般化可能な表現を学習する有望な機能が示されています。
ただし、新しいクラスでトレーニングを継続すると、VLM は以前の知識を壊滅的に忘れてしまうことがよくあります。
VLM を CIL に適用すると、2 つの大きな課題が生じます。1) 忘れることなくモデルを適応させる方法。
２）マルチモーダルな情報をどのように活用するか。
この目的を達成するために、VLM が忘れずに学習できるようにする PROjectiOn Fusion (PROOF) を提案します。
最初の課題に対処するために、凍結された画像/テキストエンコーダーに基づいてタスク固有の投影をトレーニングすることを提案します。
新しいタスクに直面すると、新しい予測が拡張され、以前の予測が固定され、古い概念の忘れが軽減されます。
2 番目の課題では、クロスモダリティ情報をより有効に活用するための融合モジュールを提案します。
視覚的特徴とテキスト的特徴を連携して調整することにより、モデルはより強力な表現能力で意味論的な情報をキャプチャできます。
9 つのベンチマークデータセットに対する広範な実験により、PROOF が最先端のパフォーマンスを達成していることが検証されています。

要約(オリジナル)

Class-Incremental Learning (CIL) or continual learning is a desired capability in the real world, which requires a learning system to adapt to new tasks without forgetting former ones. While traditional CIL methods focus on visual information to grasp core features, recent advances in Vision-Language Models (VLM) have shown promising capabilities in learning generalizable representations with the aid of textual information. However, when continually trained with new classes, VLMs often suffer from catastrophic forgetting of former knowledge. Applying VLMs to CIL poses two major challenges: 1) how to adapt the model without forgetting; and 2) how to make full use of the multi-modal information. To this end, we propose PROjectiOn Fusion (PROOF) that enables VLMs to learn without forgetting. To handle the first challenge, we propose training task-specific projections based on the frozen image/text encoders. When facing new tasks, new projections are expanded and former projections are fixed, alleviating the forgetting of old concepts. For the second challenge, we propose the fusion module to better utilize the cross-modality information. By jointly adjusting visual and textual features, the model can capture semantic information with stronger representation ability. Extensive experiments on nine benchmark datasets validate PROOF achieves state-of-the-art performance.

arxiv情報

著者	Da-Wei Zhou,Yuanhan Zhang,Jingyi Ning,Han-Jia Ye,De-Chuan Zhan,Ziwei Liu
発行日	2023-05-30 17:59:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Learning without Forgetting for Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー