Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages

要約

An old-school recipe for training a classifier is to (i) learn a good feature extractor and (ii) optimize a linear layer atop.
少数のショット適応（FSA）のように、カテゴリごとに少数のサンプルのみが利用可能である場合、データは多数のパラメーターに適合するには不十分であり、上記を非現実的にします。
これは、パラメーター効率の高い微調整（PEFT）とFSAの交差点での成功した研究を動機付けた大規模な訓練を受けたビジョン言語モデル（VLM）に特に当てはまります。
この作業では、「ベース」クラスと呼ばれるカテゴリのサブセットのみの少数のデータのみでトレーニングされたときに、PEFTテクニックの学習ダイナミクスを分析することから始めます。
このようなダイナミクスが自然に2つの異なるフェーズに分割されることを示します：（i）タスクレベルの特徴抽出と（ii）利用可能な概念への専門化。
To accommodate this dynamic, we then depart from prompt- or adapter-based methods and tackle FSA differently.
具体的には、固定された計算予算を考慮して、（i）PEFTを介してタスク固有の特徴抽出器を学習し、（ii）上部に線形分類器を訓練することに分割します。
We call this scheme Two-Stage Few-Shot Adaptation (2SFS).
確立された方法とは異なる方法で、私たちのスキームは、カテゴリレベルでの新しい形式の選択的推論を可能にします。つまり、テスト時に、新しいカテゴリのみが適応されたテキストエンコーダーに埋め込まれ、ベースカテゴリの埋め込みは分類器内に入手できます。
結果は、2つの設定、3つのバックボーン、および11のデータセットにわたって固定されたハイパーパラメーターを使用して、2SFが最先端に一致または上回ることを示していますが、確立された方法は設定全体で大幅に劣化しています。

要約(オリジナル)

An old-school recipe for training a classifier is to (i) learn a good feature extractor and (ii) optimize a linear layer atop. When only a handful of samples are available per category, as in Few-Shot Adaptation (FSA), data are insufficient to fit a large number of parameters, rendering the above impractical. This is especially true with large pre-trained Vision-Language Models (VLMs), which motivated successful research at the intersection of Parameter-Efficient Fine-tuning (PEFT) and FSA. In this work, we start by analyzing the learning dynamics of PEFT techniques when trained on few-shot data from only a subset of categories, referred to as the “base” classes. We show that such dynamics naturally splits into two distinct phases: (i) task-level feature extraction and (ii) specialization to the available concepts. To accommodate this dynamic, we then depart from prompt- or adapter-based methods and tackle FSA differently. Specifically, given a fixed computational budget, we split it to (i) learn a task-specific feature extractor via PEFT and (ii) train a linear classifier on top. We call this scheme Two-Stage Few-Shot Adaptation (2SFS). Differently from established methods, our scheme enables a novel form of selective inference at a category level, i.e., at test time, only novel categories are embedded by the adapted text encoder, while embeddings of base categories are available within the classifier. Results with fixed hyperparameters across two settings, three backbones, and eleven datasets, show that 2SFS matches or surpasses the state-of-the-art, while established methods degrade significantly across settings.

arxiv情報

著者	Matteo Farina,Massimiliano Mancini,Giovanni Iacca,Elisa Ricci
発行日	2025-03-14 17:24:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rethinking Few-Shot Adaptation of Vision-Language Models in Two Stages

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー