CLIP-Adapter: Better Vision-Language Models with Feature Adapters

要約

大規模な対照的な視覚言語のプリトレーニングは、視覚表現学習に大きな進歩を示しています。
個別のラベルの固定セットで訓練された従来の視覚システムとは異なり、新しいパラダイムが\ cite {radford2021Learning}で導入され、オープンボキャブリの設定で画像を生のテキストと直接調整することが直接学習されました。
ダウンストリームタスクでは、ゼロショット予測を行うために慎重に選択されたテキストプロンプトが採用されています。〜自明でないプロンプトエンジニアリングを回避するために、コンテキスト最適化\ Cite {Zhou2021Coop}が少数のショットトレーニングの例でタスク固有のプロンプトとして連続ベクターを学習することが提案されています。
テキスト入力は、視覚的または言語ブランチで機能アダプターを使用して微調整するようにクリップアダプターを提案します。
具体的には、Clip-Adapterは追加のボトルネックレイヤーを採用して新しい機能を学習し、オリジナルの事前トレーニングを受けた機能とブレンドをブレンドします。その結果、Clip-Adapterは、シンプルなデザインを維持しながらコンテキストの最適化を上回ることができます。
さまざまな視覚分類タスクに関する実験と広範なアブレーション研究は、私たちのアプローチの有効性を示しています。
コードはt https://github.com/gaopengcuhk/clip-adapterでリリースされます。

要約(オリジナル)

Large-scale contrastive vision-language pre-training has shown significant progress in visual representation learning. Unlike traditional visual systems trained by a fixed set of discrete labels, a new paradigm was introduced in \cite{radford2021learning} to directly learn to align images with raw texts in an open-vocabulary setting. On downstream tasks, a carefully chosen text prompt is employed to make zero-shot predictions.~To avoid non-trivial prompt engineering, context optimization \cite{zhou2021coop} has been proposed to learn continuous vectors as task-specific prompts with few-shot training examples.~In this paper, we show that there is an alternative path to achieve better vision-language models other than prompt tuning.~While prompt tuning is for the textual inputs, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch. Specifically, CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending with the original pre-trained features.~As a consequence, CLIP-Adapter is able to outperform context optimization while maintains a simple design. Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach. Code is released at t https://github.com/gaopengcuhk/CLIP-Adapter.

arxiv情報

著者	Peng Gao,Shijie Geng,Renrui Zhang,Teli Ma,Rongyao Fang,Yongfeng Zhang,Hongsheng Li,Yu Qiao
発行日	2025-03-25 14:34:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー