Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders

要約

大規模な視覚言語基盤モデルの急増にも関わらず、これらのモデルの微調整後の学習と忘却の結果の推定は、ほとんど解明されていないままです。
コントラストデュアルエンコーダにおけるモダリティギャップの重要性を強調した研究に触発されて、我々はInter-Intra Modal Measure (IIMM)を提案します。
画像埋め込み間の類似性と、誤った画像とラベル埋め込みのペア間の類似性を定量化する用語を組み合わせることで、IIMM は微調整によるパフォーマンス変化の強力な予測因子として機能します。
4 つの最先端のビジョン言語モデル (CLIP、SigLIP、CoCa、EVA-02-CLIP) と 5 つの微調整技術 (フル微調整、BitFit、注意重み調整、LoRA) にわたる広範な実証分析
、CLIP-Adapter) は、統計的に有意な強力な線形関係を示しています。より高い IIMM スコアを持つタスクを微調整すると、ドメイン内のパフォーマンスがさらに向上しますが、パラメーター効率の良い微調整により、より深刻なドメイン外のパフォーマンス低下も引き起こされます。
極端な忘却を示すチューニング (PEFT) メソッド。
我々の測定値と最先端のモデル選択方法による転送スコアを比較し、IIMM の方が精度向上の予測性が大幅に高いことを示します。
ターゲットデータを 1 回転送するだけで、専門家はこの重要な洞察を活用して、微調整後にモデルがどの程度改善されると期待できるかをヒューリスティックに評価できます。
いくつかの多様なタスクにおけるモデルのパフォーマンスに関する追加の知識が与えられると、このヒューリスティックは、新しいタスクのトレーニング時に予想されるパフォーマンスの変化を強力に予測するものにさらに進化します。

要約(オリジナル)

Despite the proliferation of large vision-language foundation models, estimation of the learning and forgetting outcomes following fine-tuning of these models remains largely unexplored. Inspired by work highlighting the significance of the modality gap in contrastive dual-encoders, we propose the Inter-Intra Modal Measure (IIMM). Combining terms quantifying the similarity between image embeddings and the similarity between incorrect image and label embedding pairs, the IIMM functions as a strong predictor of performance changes with fine-tuning. Our extensive empirical analysis across four state-of-the-art vision-language models (CLIP, SigLIP, CoCa, EVA-02-CLIP) and five fine-tuning techniques (full fine-tuning, BitFit, attention-weight tuning, LoRA, CLIP-Adapter) demonstrates a strong, statistically significant linear relationship: fine-tuning on tasks with higher IIMM scores produces greater in-domain performance gains but also induces more severe out-of-domain performance degradation, with some parameter-efficient fine-tuning (PEFT) methods showing extreme forgetting. We compare our measure against transfer scores from state-of-the-art model selection methods and show that the IIMM is significantly more predictive of accuracy gains. With only a single forward pass of the target data, practitioners can leverage this key insight to heuristically evaluate the degree to which a model can be expected to improve following fine-tuning. Given additional knowledge about the model’s performance on a few diverse tasks, this heuristic further evolves into a strong predictor of expected performance changes when training for new tasks.

arxiv情報

著者	Laura Niss,Kevin Vogt-Lowell,Theodoros Tsiligkaridis
発行日	2024-07-22 15:35:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Zero-Shot Embeddings Inform Learning and Forgetting with Vision-Language Encoders

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー