EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning

要約

画像テキストマッチングの最近の進歩は注目に値しましたが、主に広範なクエリに対応し、微調整されたクエリの意図に対応することに苦労しています。
この論文では、\ textbf {e} ntity-centric \ textbf {i} mage- \ textbf {t} ext \ textbf {m} atching（eitm）に向けて取り組みます。これは、テキストと画像が特定のエンティティ関連情報を含むタスクです。
このタスクの課題は、主にエンティティアソシエーションモデリングのより大きなセマンティックギャップにあり、一般的な画像テキストマッチングの問題と比較して、エンティティ中心のテキストと画像の間の大きなセマンティックギャップを狭めるために、バックボーンとして基本的なクリップを採用し、マルチモーダルの丁寧なコントラスト学習フレームワークをTAMクリップにEITMの問題を順応させるために使用します。
マルチモーダルの丁寧な対照学習の鍵は、ブリッジの手がかりとして大きな言語モデル（LLM）を使用して解釈的説明テキストを生成することです。
具体的には、既製のLLMSから説明テキストを抽出します。
この説明テキストは、画像とテキストと組み合わせて、特別に作成されたマルチモーダルAttentive Experts（MMAE）モジュールに入力されます。これにより、説明テキストを効果的に統合して、共有セマンティックスペースのエンティティ関連テキストと画像のギャップを絞り込みます。
MMAEから派生した濃縮機能に基づいて、効果的なゲート統合画像テキストマッチング（GI-ITM）戦略をさらに設計します。
GI-ITMは、MMAEの特徴を集約するための適応型ゲーティングメカニズムを採用し、その後、テキストと画像の間のアライメントを操縦するために画像テキストマッチング制約を適用します。
N24News、VisualNews、GoodNewsなどの3つのソーシャルメディアニュースベンチマークで広範な実験が行われます。結果は、この方法が競合方法を明確なマージンで上回ることを示しています。

要約(オリジナル)

Recent advancements in image-text matching have been notable, yet prevailing models predominantly cater to broad queries and struggle with accommodating fine-grained query intention. In this paper, we work towards the \textbf{E}ntity-centric \textbf{I}mage-\textbf{T}ext \textbf{M}atching (EITM), a task that the text and image involve specific entity-related information. The challenge of this task mainly lies in the larger semantic gap in entity association modeling, comparing with the general image-text matching problem.To narrow the huge semantic gap between the entity-centric text and the images, we take the fundamental CLIP as the backbone and devise a multimodal attentive contrastive learning framework to tam CLIP to adapt EITM problem, developing a model named EntityCLIP. The key of our multimodal attentive contrastive learning is to generate interpretive explanation text using Large Language Models (LLMs) as the bridge clues. In specific, we proceed by extracting explanatory text from off-the-shelf LLMs. This explanation text, coupled with the image and text, is then input into our specially crafted Multimodal Attentive Experts (MMAE) module, which effectively integrates explanation texts to narrow the gap of the entity-related text and image in a shared semantic space. Building on the enriched features derived from MMAE, we further design an effective Gated Integrative Image-text Matching (GI-ITM) strategy. The GI-ITM employs an adaptive gating mechanism to aggregate MMAE’s features, subsequently applying image-text matching constraints to steer the alignment between the text and the image. Extensive experiments are conducted on three social media news benchmarks including N24News, VisualNews, and GoodNews, the results shows that our method surpasses the competition methods with a clear margin.

arxiv情報

著者	Yaxiong Wang,Yujiao Wu,Lianwei Wu,Lechao Cheng,Zhun Zhong,Meng Wang
発行日	2025-04-10 14:23:37+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー