Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

要約

Zero-Shot Composed Image Retrieval（ZS-CIR）は、注釈付きのトレーニングデータに依存することで、参照画像と修正テキストで構成される構成クエリを考慮して、ターゲット画像を取得することを目的としています。
既存のアプローチは、大規模な言語モデル（LLM）を使用して合成ターゲットテキストを生成し、構成クエリとターゲット画像の間の中間アンカーとして機能します。
次に、モデルをトレーニングして、構成クエリを生成されたテキストに合わせ、対応する学習を使用して対応するテキストと個別に画像を整列させます。
ただし、中間テキストへのこの依存は、クエリからテキストへの不正確さとテキスト間マッピングが蓄積し、最終的に検索パフォーマンスを低下させるため、エラーの伝播をもたらします。
これらの問題に対処するために、ZS-CIRにマルチモーダル推論エージェント（MRA）を採用することにより、新しいフレームワークを提案します。
MRAは、非標識画像データのみを使用して、トリプレット、<参照画像、変更テキスト、ターゲット画像>を直接構築することにより、テキスト仲介業者への依存を排除します。
これらの合成トリプレットをトレーニングすることにより、私たちのモデルは、構成クエリと候補画像の間の関係を直接キャプチャすることを学びます。
3つの標準CIRベンチマークでの広範な実験は、アプローチの有効性を示しています。
FashionIQデータセットでは、この方法は既存のベースラインで平均R@10×7.5 \％を改善します。
CIRRでは、R@1 x 9.6 \％を高めます。
CIRCOでは、MAP@5 x 9.5 \％を増やします。

要約(オリジナル)

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve target images given a compositional query, consisting of a reference image and a modifying text-without relying on annotated training data. Existing approaches often generate a synthetic target text using large language models (LLMs) to serve as an intermediate anchor between the compositional query and the target image. Models are then trained to align the compositional query with the generated text, and separately align images with their corresponding texts using contrastive learning. However, this reliance on intermediate text introduces error propagation, as inaccuracies in query-to-text and text-to-image mappings accumulate, ultimately degrading retrieval performance. To address these problems, we propose a novel framework by employing a Multimodal Reasoning Agent (MRA) for ZS-CIR. MRA eliminates the dependence on textual intermediaries by directly constructing triplets, , using only unlabeled image data. By training on these synthetic triplets, our model learns to capture the relationships between compositional queries and candidate images directly. Extensive experiments on three standard CIR benchmarks demonstrate the effectiveness of our approach. On the FashionIQ dataset, our method improves Average R@10 by at least 7.5\% over existing baselines; on CIRR, it boosts R@1 by 9.6\%; and on CIRCO, it increases mAP@5 by 9.5\%.

arxiv情報

著者	Rong-Cheng Tu,Wenhao Sun,Hanzhe You,Yingjie Wang,Jiaxing Huang,Li Shen,Dacheng Tao
発行日	2025-05-26 13:17:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Reasoning Agent for Zero-Shot Composed Image Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー