FLAIR: VLM with Fine-grained Language-informed Image Representations

要約

CLIPは、画像とテキストを拡大縮小して整列させるという素晴らしい結果を示している。しかし、CLIPは大域的なレベルで画像とテキストをマッチングさせるため、詳細な視覚的特徴を捉える能力には限界がある。この問題に対処するために、我々はFLAIR（Fine-grained Language-informed Image Representations）を提案する。FLAIRは、局所的な画像埋め込みを学習するために、長く詳細な画像記述を利用するアプローチである。画像に関するきめ細かな詳細を記述する多様なサブキャプションをサンプリングすることで、我々は視覚言語モデルを訓練し、大域的な埋め込みだけでなく、テキスト固有の画像表現も生成する。我々のモデルは、局所的な画像トークンの上にテキスト条件付き注意プーリングを導入することで、詳細な画像内容の検索に優れたきめ細かな画像表現を生成する。我々は、既存のマルチモーダル検索ベンチマークと、新たに導入した、部分的な画像コンテンツを検索する視覚言語モデルの能力を評価するきめ細かな検索タスクの両方で、最先端の性能を達成した。さらに、我々の実験は、ゼロショット意味分割を含む、きめ細かな視覚情報をキャプチャする上で、30Mの画像とテキストのペアで訓練されたFLAIRの有効性を実証し、数十億のペアで訓練されたモデルを凌駕する。コードは https://github.com/ExplainableML/flair にある。

要約(オリジナル)

CLIP has shown impressive results in aligning images and texts at scale. However, its ability to capture detailed visual features remains limited because CLIP matches images and texts at a global level. To address this issue, we propose FLAIR, Fine-grained Language-informed Image Representations, an approach that utilizes long and detailed image descriptions to learn localized image embeddings. By sampling diverse sub-captions that describe fine-grained details about an image, we train our vision-language model to produce not only global embeddings but also text-specific image representations. Our model introduces text-conditioned attention pooling on top of local image tokens to produce fine-grained image representations that excel at retrieving detailed image content. We achieve state-of-the-art performance on both, existing multimodal retrieval benchmarks, as well as, our newly introduced fine-grained retrieval task which evaluates vision-language models’ ability to retrieve partial image content. Furthermore, our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information, including zero-shot semantic segmentation, outperforming models trained on billions of pairs. Code is available at https://github.com/ExplainableML/flair .

arxiv情報

著者	Rui Xiao,Sanghwan Kim,Mariana-Iuliana Georgescu,Zeynep Akata,Stephan Alaniz
発行日	2024-12-04 18:56:04+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

FLAIR: VLM with Fine-grained Language-informed Image Representations

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー