LSEH: Semantically Enhanced Hard Negatives for Cross-modal Information Retrieval

要約

Visual Semantic Embedding (VSE) は、画像のセマンティクスとその説明を抽出し、クロスモーダルな情報検索のためにそれらを同じ潜在空間に埋め込むことを目的としています。
ほとんどの既存の VSE ネットワークは、関連する画像説明埋め込みペアと関連しない画像説明埋め込みペアの類似性の間の客観的なマージンを学習する、ハードネガティブ損失関数を採用することによってトレーニングされます。
ただし、ハードネガティブ損失関数の客観的マージンは、無関係な画像と説明のペアのセマンティックの違いを無視する固定ハイパーパラメーターとして設定されます。
トレーニング済みの VSE ネットワークを取得する前に、画像と説明のペア間の最適な類似性を測定するという課題に対処するために、このホワイトペーパーでは、2 つの主要な部分で構成される新しいアプローチを紹介します。
(2) 新しい意味論的に強化されたハードネガティブ損失関数を提案します。この関数では、無関係な画像と説明のペア間の最適な類似性スコアに基づいて学習目標が動的に決定されます。
クロスモーダル情報検索タスクの 3 つのベンチマークデータセットに適用された 5 つの最先端の VSE ネットワークに提案された方法を統合することにより、広範な実験が行われました。
その結果、提案された方法が最高のパフォーマンスを達成し、既存および将来の VSE ネットワークにも採用できることが明らかになりました。

要約(オリジナル)

Visual Semantic Embedding (VSE) aims to extract the semantics of images and their descriptions, and embed them into the same latent space for cross-modal information retrieval. Most existing VSE networks are trained by adopting a hard negatives loss function which learns an objective margin between the similarity of relevant and irrelevant image-description embedding pairs. However, the objective margin in the hard negatives loss function is set as a fixed hyperparameter that ignores the semantic differences of the irrelevant image-description pairs. To address the challenge of measuring the optimal similarities between image-description pairs before obtaining the trained VSE networks, this paper presents a novel approach that comprises two main parts: (1) finds the underlying semantics of image descriptions; and (2) proposes a novel semantically enhanced hard negatives loss function, where the learning objective is dynamically determined based on the optimal similarity scores between irrelevant image-description pairs. Extensive experiments were carried out by integrating the proposed methods into five state-of-the-art VSE networks that were applied to three benchmark datasets for cross-modal information retrieval tasks. The results revealed that the proposed methods achieved the best performance and can also be adopted by existing and future VSE networks.

arxiv情報

著者	Yan Gong,Georgina Cosma
発行日	2022-10-10 15:09:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

LSEH: Semantically Enhanced Hard Negatives for Cross-modal Information Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー