Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment

要約

CLIPのような視覚言語モデル(VLM)は、卓越した汎化能力を実証しており、迅速な微調整によって下流のタスクに迅速に適応することができる。しかし、オープンボキャブラリーセッティングとして知られる非トレーニングクラスを含む分類タスクでは、ファインチューニングされたVLMはしばしばトレーニングクラスに過剰適合し、その結果、未知のクラスにおける信頼度スコアと実際の精度の間にずれが生じ、実世界での展開における信頼性が著しく損なわれる。既存の信頼度キャリブレーション手法は、通常、訓練パラメータを必要とするか、訓練データセットから特徴を分析する必要があるため、対応する訓練データがない未知のクラスを一般化する能力が制限される。さらに、VLMに特化した校正手法は、校正指標として訓練クラスのテキスト特徴のみに依存しており、訓練クラスを校正する能力が本質的に制限されている。これらの課題を解決するために、我々は効果的なマルチモーダル校正手法Contrast-Aware Calibration (CAC)を提案する。オリジナルのCLIPのゼロショット適応性と、未見のクラスに対するクラス内・クラス間識別能力の低さが根本的な原因であるという経験的分析からの結論に基づき、オリジナルのCLIPと微調整されたCLIPのコントラスト差に基づいて校正重みを計算する。この方法は、未見クラスの較正に適応するだけでなく、訓練クラスを較正できなかった以前のVLM較正法の限界も克服する。11のデータセットと5つの微調整法を用いた実験において、CACは精度や推論速度を犠牲にすることなく、訓練クラスと未見クラスの両方において一貫して最良の校正効果を達成した。

要約(オリジナル)

Vision-language models (VLMs), such as CLIP, have demonstrated exceptional generalization capabilities and can quickly adapt to downstream tasks through prompt fine-tuning. Unfortunately, in classification tasks involving non-training classes, known as open-vocabulary setting, fine-tuned VLMs often overfit to train classes, resulting in a misalignment between confidence scores and actual accuracy on unseen classes, which significantly undermines their reliability in real-world deployments. Existing confidence calibration methods typically require training parameters or analyzing features from the training dataset, restricting their ability to generalize unseen classes without corresponding train data. Moreover, VLM-specific calibration methods rely solely on text features from train classes as calibration indicators, which inherently limits their ability to calibrate train classes. To address these challenges, we propose an effective multimodal calibration method Contrast-Aware Calibration (CAC). Building on the original CLIP’s zero-shot adaptability and the conclusion from empirical analysis that poor intra-class and inter-class discriminative ability on unseen classes is the root cause, we calculate calibration weights based on the contrastive difference between the original and fine-tuned CLIP. This method not only adapts to calibrating unseen classes but also overcomes the limitations of previous VLM calibration methods that could not calibrate train classes. In experiments involving 11 datasets with 5 fine-tuning methods, CAC consistently achieved the best calibration effect on both train and unseen classes without sacrificing accuracy and inference speed.

arxiv情報

著者	Song-Lin Lv,Yu-Yang Chen,Zhi Zhou,Yu-Feng Li,Lan-Zhe Guo
発行日	2025-02-03 12:12:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Contrast-Aware Calibration for Fine-Tuned CLIP: Leveraging Image-Text Alignment

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー