CoVR: Learning Composed Video Retrieval from Web Video Captions

要約

合成画像検索 (CoIR) は、テキストと画像の両方のクエリを一緒に考慮して、データベース内の関連する画像を検索するタスクとして最近人気が高まっています。
ほとんどの CoIR アプローチでは、画像、テキスト、画像の 3 つの要素からなる手動で注釈を付けたデータセットが必要です。テキストはクエリ画像からターゲット画像への変更を説明します。
ただし、CoIR トリプレットを手動でキュレーションするとコストがかかり、スケーラビリティが妨げられます。
この研究では、代わりに、ビデオとキャプションのペアを指定してトリプレットを生成する、スケーラブルな自動データセット作成方法論を提案します。同時に、タスクの範囲を拡張して、合成ビデオ検索 (CoVR) を含めます。
この目的を達成するために、類似のキャプションを持つペアのビデオを大規模なデータベースからマイニングし、大規模な言語モデルを活用して対応する変更テキストを生成します。
この方法論を広範な WebVid2M コレクションに適用すると、WebVid-CoVR データセットが自動的に構築され、結果として 160 万個のトリプレットが生成されます。
さらに、ベースライン結果とともに、手動で注釈を付けた評価セットを使用した CoVR の新しいベンチマークを導入します。
さらに、私たちの実験では、データセットでの CoVR モデルのトレーニングが効果的に CoIR に移行し、CIRR ベンチマークと FashionIQ ベンチマークの両方でゼロショット設定における最先端のパフォーマンスの向上につながることが実証されました。
私たちのコード、データセット、モデルは https://imagine.enpc.fr/~ventural/covr で公開されています。

要約(オリジナル)

Composed Image Retrieval (CoIR) has recently gained popularity as a task that considers both text and image queries together, to search for relevant images in a database. Most CoIR approaches require manually annotated datasets, comprising image-text-image triplets, where the text describes a modification from the query image to the target image. However, manual curation of CoIR triplets is expensive and prevents scalability. In this work, we instead propose a scalable automatic dataset creation methodology that generates triplets given video-caption pairs, while also expanding the scope of the task to include composed video retrieval (CoVR). To this end, we mine paired videos with a similar caption from a large database, and leverage a large language model to generate the corresponding modification text. Applying this methodology to the extensive WebVid2M collection, we automatically construct our WebVid-CoVR dataset, resulting in 1.6 million triplets. Moreover, we introduce a new benchmark for CoVR with a manually annotated evaluation set, along with baseline results. Our experiments further demonstrate that training a CoVR model on our dataset effectively transfers to CoIR, leading to improved state-of-the-art performance in the zero-shot setup on both the CIRR and FashionIQ benchmarks. Our code, datasets, and models are publicly available at https://imagine.enpc.fr/~ventural/covr.

arxiv情報

著者	Lucas Ventura,Antoine Yang,Cordelia Schmid,Gül Varol
発行日	2023-08-28 17:55:33+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CoVR: Learning Composed Video Retrieval from Web Video Captions

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー