Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model

要約

ビジョン言語基盤モデルは、大規模な画像とテキストのペアデータに対する拡張性により、多数の下流タスクにわたって目覚ましい成功を収めています。
ただし、これらのモデルは、一般化機能を妨げる「意思決定のショートカット」の結果として、詳細な画像分類などの下流タスクに適用すると、重大な制限も表示されます。
この研究では、CLIP モデルが \textit{望ましい不変因果的特徴} と \textit{望ましくない意思決定のショートカット} の両方を包含する豊富な機能セットを備えていることがわかりました。
さらに、下流タスクでの CLIP のパフォーマンスの低下は、特定のタスク要件に従って事前トレーニングされた機能を効果的に利用できないことに起因しています。
この課題に対処するために、私たちは、偽の特徴を消去することで意思決定のショートカットを軽減する、シンプルで効果的な方法である Spurious Feature Eraser (SEraser) を提案します。
具体的には、学習可能なプロンプトを最適化するテスト時のプロンプト調整パラダイムを導入します。これにより、モデルは推論段階での意思決定の近道を無視しながら不変特徴を活用するように強制されます。
提案された方法は、誤解を招く可能性のある偽の情報への過度の依存を効果的に軽減します。
提案された手法とさまざまなアプローチとの比較分析を実行し、顕著な優位性を検証します。

要約(オリジナル)

Vision-language foundation models have exhibited remarkable success across a multitude of downstream tasks due to their scalability on extensive image-text paired data. However, these models also display significant limitations when applied to downstream tasks, such as fine-grained image classification, as a result of “decision shortcuts” that hinder their generalization capabilities. In this work, we find that the CLIP model possesses a rich set of features, encompassing both \textit{desired invariant causal features} and \textit{undesired decision shortcuts}. Moreover, the underperformance of CLIP on downstream tasks originates from its inability to effectively utilize pre-trained features in accordance with specific task requirements. To address this challenge, we propose a simple yet effective method, Spurious Feature Eraser (SEraser), to alleviate the decision shortcuts by erasing the spurious features. Specifically, we introduce a test-time prompt tuning paradigm that optimizes a learnable prompt, thereby compelling the model to exploit invariant features while disregarding decision shortcuts during the inference phase. The proposed method effectively alleviates excessive dependence on potentially misleading spurious information. We conduct comparative analysis of the proposed method against various approaches which validates the significant superiority.

arxiv情報

著者	Huan Ma,Yan Zhu,Changqing Zhang,Peilin Zhao,Baoyuan Wu,Long-Kai Huang,Qinghua Hu,Bingzhe Wu
発行日	2024-06-03 07:09:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー