LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

要約

Clipは、大規模な画像テキストペアの対照学習を介して、画像とテキスト機能を共有表現空間に並べる基礎マルチモーダルモデルです。
その有効性は、主に豊かな監督としての自然言語の使用に起因しています。
大規模な言語モデル（LLMS）の顕著な進歩に動機付けられたこの作業では、LLMSの優れたテキスト理解と広範なオープンワールドの知識が、特により長くより複雑な画像キャプションを処理するために、クリップの機能をどのように強化できるかを探ります。
LLMを前処理されたクリップに統合する効率的なトレーニング後の戦略を提案します。
LLMSの自己回帰の性質によってもたらされる課題に対処するために、キャプションとキャプションのコントラスト的な微調整フレームワークを導入し、LLM出力の識別品質を大幅に向上させます。
広範な実験は、私たちのアプローチがロラベースの方法よりも優れていることを示しており、優れたパフォーマンスでほぼ4倍高速なトレーニングを達成しています。
さらに、さまざまなゼロショットマルチモーダル検索タスク、横断的検索タスク、およびマルチモーダル言語モデルモデルの前編成にわたって、CLIP、EVA02、SIGLIP2などの最先端モデルよりも大幅な改善を検証します。

要約(オリジナル)

CLIP is a foundational multimodal model that aligns image and text features into a shared representation space via contrastive learning on large-scale image-text pairs. Its effectiveness primarily stems from the use of natural language as rich supervision. Motivated by the remarkable advancements in large language models (LLMs), this work explores how LLMs’ superior text understanding and extensive open-world knowledge can enhance CLIP’s capability, especially for processing longer and more complex image captions. We propose an efficient post-training strategy that integrates LLMs into pretrained CLIP. To address the challenge posed by the autoregressive nature of LLMs, we introduce a caption-to-caption contrastive fine-tuning framework, significantly enhancing the discriminative quality of LLM outputs. Extensive experiments demonstrate that our approach outperforms LoRA-based methods, achieving nearly fourfold faster training with superior performance. Furthermore, we validate substantial improvements over state-of-the-art models such as CLIP, EVA02, and SigLip2 across various zero-shot multimodal retrieval tasks, cross-lingual retrieval tasks, and multimodal language model pretraining.

arxiv情報

著者	Weiquan Huang,Aoqi Wu,Yifan Yang,Xufang Luo,Yuqing Yang,Liang Hu,Qi Dai,Chunyu Wang,Xiyang Dai,Dongdong Chen,Chong Luo,Lili Qiu
発行日	2025-05-07 16:51:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー