Captured by Captions: On Memorization and its Mitigation in CLIP Models

要約

クリップなどのマルチモーダルモデルは、画像検索やゼロショット分類などのタスクに優れている視覚的表現とテキスト表現の調整において強力なパフォーマンスを実証しています。
この成功にもかかわらず、これらのモデルがトレーニングデータ、特に暗記の役割を利用するメカニズムは不明のままです。
監督されたものと自己科学者の両方のユニモーダルモデルでは、暗記は一般化に不可欠であることが示されています。
ただし、これらの調査結果がクリップにどのように適用されるかはよく理解されていません。これは、ラベルと同様の監督信号を提供するキャプションを介して、および対照的な目的を介した自己監視学習の両方のキャプションを介して、監視された学習の両方からの要素を組み込んでいます。
このギャップを理解するために、クリップの暗記の正式な定義を提案し、それを使用してクリップモデルの暗記を定量化します。
私たちの結果は、クリップの暗記行動が監督されたパラダイムと自己監視のパラダイムの間にあることを示しています。
さらに、テキストエンコーダーは画像エンコーダよりも暗記に多くの貢献をしていることがわかり、緩和戦略がテキストドメインに焦点を当てるべきであることが示唆されています。
これらの洞察に基づいて、私たちは暗記を減らすと同時に、実用性を改善するための複数の戦略を提案します。これは、一般的に暗記を減らすことでユーティリティが減少するという伝統的な学習パラダイムのために示されていなかったものです。

要約(オリジナル)

Multi-modal models, such as CLIP, have demonstrated strong performance in aligning visual and textual representations, excelling in tasks like image retrieval and zero-shot classification. Despite this success, the mechanisms by which these models utilize training data, particularly the role of memorization, remain unclear. In uni-modal models, both supervised and self-supervised, memorization has been shown to be essential for generalization. However, it is not well understood how these findings would apply to CLIP, which incorporates elements from both supervised learning via captions that provide a supervisory signal similar to labels, and from self-supervised learning via the contrastive objective. To bridge this gap in understanding, we propose a formal definition of memorization in CLIP (CLIPMem) and use it to quantify memorization in CLIP models. Our results indicate that CLIP’s memorization behavior falls between the supervised and self-supervised paradigms, with ‘mis-captioned’ samples exhibiting highest levels of memorization. Additionally, we find that the text encoder contributes more to memorization than the image encoder, suggesting that mitigation strategies should focus on the text domain. Building on these insights, we propose multiple strategies to reduce memorization while at the same time improving utility–something that had not been shown before for traditional learning paradigms where reducing memorization typically results in utility decrease.

arxiv情報

著者	Wenhao Wang,Adam Dziedzic,Grace C. Kim,Michael Backes,Franziska Boenisch
発行日	2025-05-19 15:22:54+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Captured by Captions: On Memorization and its Mitigation in CLIP Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー