Parrot Captions Teach CLIP to Spot Text

要約

CLIP は数多くのビジョン言語アプリケーションの基礎モデルであるにもかかわらず、深刻なテキストスポッティングバイアスに悩まされています。
このようなバイアスにより、CLIP モデルは、本物の視覚的なセマンティクスを無視して、画像内に埋め込まれた視覚的なテキストを「オウム返し」にしてしまいます。
最も人気のある画像テキストデータセット LAION-2B では、キャプションも画像に埋め込まれたテキストを密にオウム返し (綴り) していることがわかりました。
私たちの分析によると、画像の約 50% にはビジュアルテキストコンテンツが埋め込まれており、キャプションの単語の約 30% はこれらの埋め込まれたビジュアルコンテンツに含まれています。
このような観察に基づいて、CLIP モデルのさまざまなリリースバージョンを徹底的に検査し、ビジュアルテキストがこれらのモデルの LAION スタイルの画像とテキストの類似性を測定する際の主要な要素であることを確認します。
これらのオウムのキャプションがテキストスポッティングバイアスを形成しているかどうかを調べるために、さまざまなオウムのキャプション指向の基準によって厳選された LAION サブセットを使用して一連の CLIP モデルをトレーニングします。
オウムのキャプションを使用したトレーニングはそのようなバイアスを容易に形成しますが、CLIP モデルで期待される視覚言語表現の学習に悪影響を与えることを示します。
これは、CLIP のようなモデルの設計、または CLIP スコアフィルタリングに基づいて構築された既存の画像テキストデータセットキュレーションパイプラインのいずれかの設計を再検討することが急務であることを示唆しています。

要約(オリジナル)

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot’ the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50% of images are embedded with visual text content, and around 30% of captions words are in these embedded visual content. Based on such observation, we thoroughly inspect the different released versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

arxiv情報

著者	Yiqi Lin,Conghui He,Alex Jinpeng Wang,Bin Wang,Weijia Li,Mike Zheng Shou
発行日	2024-02-01 13:06:51+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Parrot Captions Teach CLIP to Spot Text

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー