Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

要約

参照画像と相対キャプションで構成されるクエリが与えられた場合、合成画像検索の目標は、キャプションによって表現された変更を統合した参照画像と視覚的に類似した画像を検索することです。
最近の研究でさまざまなタスクにおける大規模な視覚および言語事前トレーニング (VLP) モデルの有効性が実証されていることを考慮して、OpenAI CLIP モデルの機能を利用して検討中のタスクに取り組みます。
最初に、視覚的特徴とテキスト的特徴の要素ごとの合計を使用して、両方の CLIP エンコーダーのタスク指向の微調整を実行します。
次に、第 2 段階では、バイモーダル情報を統合し、検索の実行に使用される結合された特徴を提供する画像とテキストの特徴を組み合わせる方法を学習する Combiner ネットワークをトレーニングします。
私たちはトレーニングの両方の段階で対照的な学習を使用します。
ベースラインとしての裸の CLIP 機能から始めた実験結果は、タスク指向の微調整と慎重に作成された Combiner ネットワークが非常に効果的であり、人気のある 2 つの FashionIQ と CIRR に対するより複雑な最先端のアプローチよりも優れていることを示しています。
合成画像検索のための困難なデータセット。
コードと事前トレーニングされたモデルは https://github.com/ABaldrati/CLIP4Cir で入手できます。

要約(オリジナル)

Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one that integrates the modifications expressed by the caption. Given that recent research has demonstrated the efficacy of large-scale vision and language pre-trained (VLP) models in various tasks, we rely on features from the OpenAI CLIP model to tackle the considered task. We initially perform a task-oriented fine-tuning of both CLIP encoders using the element-wise sum of visual and textual features. Then, in the second stage, we train a Combiner network that learns to combine the image-text features integrating the bimodal information and providing combined features used to perform the retrieval. We use contrastive learning in both stages of training. Starting from the bare CLIP features as a baseline, experimental results show that the task-oriented fine-tuning and the carefully crafted Combiner network are highly effective and outperform more complex state-of-the-art approaches on FashionIQ and CIRR, two popular and challenging datasets for composed image retrieval. Code and pre-trained models are available at https://github.com/ABaldrati/CLIP4Cir

arxiv情報

著者	Alberto Baldrati,Marco Bertini,Tiberio Uricchio,Alberto del Bimbo
発行日	2023-08-22 15:03:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Composed Image Retrieval using Contrastive Learning and Task-oriented CLIP-based Features

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー