Jina CLIP: Your CLIP Model Is Also Your Text Retriever

要約

Contrastive Language-Image Pretraining (CLIP) は、画像とテキストを固定サイズのベクトルにマッピングすることで共通の埋め込み空間に位置合わせするモデルをトレーニングするために広く使用されています。
これらのモデルは、マルチモーダルな情報検索と関連タスクの鍵となります。
ただし、CLIP モデルは一般に、特殊なテキストモデルと比較して、テキストのみのタスクではパフォーマンスが低下します。
これにより、テキストのみのタスクやマルチモーダルなタスクに対して個別の埋め込みとモデルを保持する情報検索システムの効率が低下します。
私たちは、この問題に対処するための新しいマルチタスク対比トレーニング方法を提案します。これを使用して jina-clip-v1 モデルをトレーニングし、テキスト画像検索タスクとテキスト画像検索タスクの両方で最先端のパフォーマンスを達成します。
。

要約(オリジナル)

Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.

arxiv情報

著者	Andreas Koukounas,Georgios Mastrapas,Michael Günther,Bo Wang,Scott Martens,Isabelle Mohr,Saba Sturua,Mohammad Kalim Akram,Joan Fontanals Martínez,Saahil Ognawala,Susana Guzman,Maximilian Werk,Nan Wang,Han Xiao
発行日	2024-06-26 12:31:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Jina CLIP: Your CLIP Model Is Also Your Text Retriever

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー