GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

要約

テキストクエリが与えられた場合、部分関連ビデオ検索（PRVR）は、適切な瞬間を含むトリミングされていないビデオをデータベースから見つけ出そうとする。PRVRでは、テキストと動画の間の部分的な関係を把握するために、クリップのモデリングが不可欠である。現在のPRVR手法は、明示的なクリップモデリングを実現するためにスキャニングベースのクリップ構築を採用しているが、これは情報冗長であり、大きなストレージオーバーヘッドを必要とする。PRVR手法の効率性の問題を解決するために、本論文では、クリップ表現を暗黙的にモデル化するガウス混合モデルベースの変換器であるGMMFormerを提案する。フレームの相互作用の間に、ガウス混合モデル制約を組み込み、各フレームをビデオ全体ではなく、隣接するフレームにフォーカスする。これにより、生成される表現にはマルチスケールのクリップ情報が含まれ、暗黙的なクリップモデリングが実現される。さらに、PRVR手法は同じ動画に関連するテキストクエリ間の意味的差異を無視するため、疎な埋め込み空間となる。我々は、これらのテキストクエリを区別するために、クエリ多様損失を提案し、埋め込み空間をより集約的にし、より多くの意味情報を含むようにする。3つの大規模動画データセット（TVR、ActivityNet Captions、Charades-STA）を用いた広範な実験により、GMMFormerの優位性と効率性を実証する。コードばurl{https://github.com/huangmozhi9527/GMMFormer} で入手可能である。

要約(オリジナル)

Given a text query, partially relevant video retrieval (PRVR) seeks to find untrimmed videos containing pertinent moments in a database. For PRVR, clip modeling is essential to capture the partial relationship between texts and videos. Current PRVR methods adopt scanning-based clip construction to achieve explicit clip modeling, which is information-redundant and requires a large storage overhead. To solve the efficiency problem of PRVR methods, this paper proposes GMMFormer, a Gaussian-Mixture-Model based Transformer which models clip representations implicitly. During frame interactions, we incorporate Gaussian-Mixture-Model constraints to focus each frame on its adjacent frames instead of the whole video. Then generated representations will contain multi-scale clip information, achieving implicit clip modeling. In addition, PRVR methods ignore semantic differences between text queries relevant to the same video, leading to a sparse embedding space. We propose a query diverse loss to distinguish these text queries, making the embedding space more intensive and contain more semantic information. Extensive experiments on three large-scale video datasets (i.e., TVR, ActivityNet Captions, and Charades-STA) demonstrate the superiority and efficiency of GMMFormer. Code is available at \url{https://github.com/huangmozhi9527/GMMFormer}.

arxiv情報

著者	Yuting Wang,Jinpeng Wang,Bin Chen,Ziyun Zeng,Shu-Tao Xia
発行日	2024-01-03 07:40:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

GMMFormer: Gaussian-Mixture-Model Based Transformer for Efficient Partially Relevant Video Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー