jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

要約

対照的な言語イメージ前削除（CLIP）は、クロスモーダル情報の検索およびマルチモーダル理解タスクに広く使用されています。
ただし、クリップモデルは、主にクロスモーダルビジョン言語タスク用に最適化されており、シングルモードテキストタスクではアンダーパフォーマンスです。
さらに、これらのモデルはしばしば英語のデータセットでトレーニングされているため、多言語の理解がありません。
さらに、視覚的な理解の観点から、以前のクリップベースのモデルは、視覚的に豊富なドキュメントを十分に理解していないことを示しています。
この作業では、テキストのみとクロスモーダルタスクの両方をサポートするために、マルチタスクおよびマルチステージの対照学習パラダイムを介してテキストペア、トリプレット、画像テキストペアで訓練された対照的な視覚言語モデルであるJina-Clip-V2を提案します。
多言語テキストエンコーダーを使用し、トレーニングデータセットを展開して、ヒンディー語、中国語、ドイツ語、フランス語などを含む29の非英語言語の多言語テキストと、視覚的に豊富な文書の画像を含めます。
モデルのパフォーマンスを評価し、Jina-Clip-V2が、ゼロショットのテキストのみの検索、セマンティックテキストの類似性、および英語と多言語の両方の設定の両方のクロスモーダル検索タスクで最先端のクリップベースのモデルよりも顕著な改善を達成することを示します。
Jina-Clip-V2は、次元を埋め込む柔軟性を提供し、ユーザーが表現の粒度を選択できるようにします。
Jina-Clip-V2は、https：//huggingface.co/jinaai/jina-clip-v2で公開されています。

要約(オリジナル)

Contrastive Language-Image Pretraining (CLIP) has been widely used for crossmodal information retrieval and multimodal understanding tasks. However, CLIP models are mainly optimized for crossmodal vision-language tasks and underperform in single-mode text tasks. Moreover, these models are often trained on English datasets and therefore lack multilingual understanding. Additionally, from a visual understanding perspective, previous CLIP-based models exhibit insufficient understanding of visually rich documents. In this work, we propose jina-clip-v2, a contrastive vision-language model trained on text pairs, triplets and image-text pairs via a multi-task and multi-stage contrastive learning paradigm in order to support both text-only and crossmodal tasks. We employ a multilingual text encoder and expand the training dataset to include multilingual texts from 29 non-English languages, including Hindi, Chinese, German, French, and others, as well as images of visually rich documents. We evaluate the model’s performance and show that jina-clip-v2 achieves notable improvements over state-of-the-art CLIP-based models in zero-shot text-only retrieval, semantic textual similarity, and crossmodal retrieval tasks in both English and multilingual settings. jina-clip-v2 also provides for flexibility in embedding dimensionality, enabling users to select the granularity of the representations. jina-clip-v2 is publicly available at https://huggingface.co/jinaai/jina-clip-v2.

arxiv情報

著者	Andreas Koukounas,Georgios Mastrapas,Sedigheh Eslami,Bo Wang,Mohammad Kalim Akram,Michael Günther,Isabelle Mohr,Saba Sturua,Nan Wang,Han Xiao
発行日	2025-04-24 16:22:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー