RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

要約

微調整された視覚言語モデル (VLM) は、画像の特徴とテキスト属性の間の誤った相関関係をキャプチャすることが多く、その結果、テスト時のゼロショットパフォーマンスが低下します。
スプリアス相関に対処するための既存のアプローチは、(i) きめの細かい画像特徴に直接介入するのではなく、主に全体的な画像レベルで動作し、(ii) 主に単峰性設定用に設計されています。
この研究では、グローバル画像レベルで動作するのではなく、ローカル画像特徴を使用して偽相関を発見して軽減することにより、VLM の堅牢性についてきめ細かい視点をとった RaVL を紹介します。
微調整された VLM が与えられると、RaVL はまず領域レベルのクラスタリングアプローチを活用して偽の相関を発見し、ゼロショット分類エラーの原因となる正確な画像特徴を特定します。
次に、RaVL は、VLM が関連領域に焦点を当て、微調整中に偽の関係を無視できるようにする新しい領域認識損失関数を使用して、特定された偽の相関を軽減します。
さまざまなモデルアーキテクチャ、データドメイン、学習されたスプリアス相関を使用して、654 個の VLM で RaVL を評価します。
私たちの結果は、RaVL が偽相関を正確に検出し (最も近いベースラインと比較して 191% 向上)、軽減する (最悪のグループ画像分類精度で 8.2% 向上) ことを示しています。
一般領域および医療領域の VLM に関する定性的評価により、我々の発見が裏付けられました。

要約(オリジナル)

Fine-tuned vision-language models (VLMs) often capture spurious correlations between image features and textual attributes, resulting in degraded zero-shot performance at test time. Existing approaches for addressing spurious correlations (i) primarily operate at the global image-level rather than intervening directly on fine-grained image features and (ii) are predominantly designed for unimodal settings. In this work, we present RaVL, which takes a fine-grained perspective on VLM robustness by discovering and mitigating spurious correlations using local image features rather than operating at the global image level. Given a fine-tuned VLM, RaVL first discovers spurious correlations by leveraging a region-level clustering approach to identify precise image features contributing to zero-shot classification errors. Then, RaVL mitigates the identified spurious correlation with a novel region-aware loss function that enables the VLM to focus on relevant regions and ignore spurious relationships during fine-tuning. We evaluate RaVL on 654 VLMs with various model architectures, data domains, and learned spurious correlations. Our results show that RaVL accurately discovers (191% improvement over the closest baseline) and mitigates (8.2% improvement on worst-group image classification accuracy) spurious correlations. Qualitative evaluations on general-domain and medical-domain VLMs confirm our findings.

arxiv情報

著者	Maya Varma,Jean-Benoit Delbrouck,Zhihong Chen,Akshay Chaudhari,Curtis Langlotz
発行日	2024-11-06 18:25:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

RaVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー