Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

要約

「Learning to hash」は効率的な検索のための実用的なソリューションであり、高速な検索速度と低いストレージコストを提供します。
画像とテキストのクロスモーダル検索など、さまざまなアプリケーションに広く適用されています。
この論文では、視覚言語事前トレーニング (VLP) モデルなどの強力で大規模な事前トレーニング済みモデルの普及により、ハッシュ学習のパフォーマンスが向上する可能性を探ります。
クロスモーダル量子化のための蒸留 (DCMQ) という新しい方法を導入します。これは、VLP モデルの豊富なセマンティック知識を活用して、ハッシュ表現の学習を改善します。
具体的には、VLP を「教師」として使用し、コードブックを備えた「生徒」ハッシュモデルに知識を抽出します。
このプロセスには、マルチホットベクトルで構成され、セマンティクスが欠如している教師ありラベルを、VLP の豊富なセマンティクスで置き換えることが含まれます。
最後に、ペア一貫性による正規化 (NPC) と呼ばれる変換を適用して、蒸留の識別目標を達成します。
さらに、バランスの取れたコードブック学習を促進する新しい量子化手法である Product Quantization with Gumbel (PQG) を導入し、検索パフォーマンスを向上させます。
広範なベンチマークテストにより、DCMQ が既存の教師ありクロスモーダルハッシュアプローチよりも常に優れたパフォーマンスを示し、その大きな可能性が示されています。

要約(オリジナル)

“Learning to hash” is a practical solution for efficient retrieval, offering fast search speed and low storage cost. It is widely applied in various applications, such as image-text cross-modal search. In this paper, we explore the potential of enhancing the performance of learning to hash with the proliferation of powerful large pre-trained models, such as Vision-Language Pre-training (VLP) models. We introduce a novel method named Distillation for Cross-Modal Quantization (DCMQ), which leverages the rich semantic knowledge of VLP models to improve hash representation learning. Specifically, we use the VLP as a `teacher’ to distill knowledge into a `student’ hashing model equipped with codebooks. This process involves the replacement of supervised labels, which are composed of multi-hot vectors and lack semantics, with the rich semantics of VLP. In the end, we apply a transformation termed Normalization with Paired Consistency (NPC) to achieve a discriminative target for distillation. Further, we introduce a new quantization method, Product Quantization with Gumbel (PQG) that promotes balanced codebook learning, thereby improving the retrieval performance. Extensive benchmark testing demonstrates that DCMQ consistently outperforms existing supervised cross-modal hashing approaches, showcasing its significant potential.

arxiv情報

著者	Young Kyun Jang,Donghyun Kim,Ser-nam Lim
発行日	2024-05-23 15:54:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distilling Vision-Language Pretraining for Efficient Cross-Modal Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー