Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

要約

検索システムでは、同時に検索の精度と効率を達成することは本質的に困難です。
この課題は、部分的に関連するビデオ検索（PRVR）で特に顕著です。この場合、各ビデオのさまざまな時間スケールでより多様なコンテキスト表現を組み込むと、精度が向上しますが、計算コストとメモリコストが増加します。
この二分法に対処するために、ビデオ内の多様なコンテキストを固定数のプロトタイプにコードするプロトタイプのPRVRフレームワークを提案します。
次に、プロトタイプ内のテキスト関連とビデオの理解を強化するためのいくつかの戦略を導入し、プロトタイプが多様なコンテンツをキャプチャすることを保証する直交目的を紹介します。
ビデオコンテキストを正確にエンコードしながら、テキストクエリを介してプロトタイプを検索可能に保つために、クロスおよびユニモーダルの再構成タスクを実装します。
クロスモーダル再構成タスクは、プロトタイプを共有スペース内のテキスト機能に合わせますが、Uni-Modal再構成タスクはエンコード中にすべてのビデオコンテキストを保持します。
さらに、ビデオミキシング手法を採用して、プロトタイプと関連するテキスト表現をさらに調整するための弱いガイダンスを提供します。
TVR、ActivityNet-Captions、QVHighlightsの広範な評価は、効率を犠牲にすることなくアプローチの有効性を検証します。

要約(オリジナル)

In a retrieval system, simultaneously achieving search accuracy and efficiency is inherently challenging. This challenge is particularly pronounced in partially relevant video retrieval (PRVR), where incorporating more diverse context representations at varying temporal scales for each video enhances accuracy but increases computational and memory costs. To address this dichotomy, we propose a prototypical PRVR framework that encodes diverse contexts within a video into a fixed number of prototypes. We then introduce several strategies to enhance text association and video understanding within the prototypes, along with an orthogonal objective to ensure that the prototypes capture a diverse range of content. To keep the prototypes searchable via text queries while accurately encoding video contexts, we implement cross- and uni-modal reconstruction tasks. The cross-modal reconstruction task aligns the prototypes with textual features within a shared space, while the uni-modal reconstruction task preserves all video contexts during encoding. Additionally, we employ a video mixing technique to provide weak guidance to further align prototypes and associated textual representations. Extensive evaluations on TVR, ActivityNet-Captions, and QVHighlights validate the effectiveness of our approach without sacrificing efficiency.

arxiv情報

著者	WonJun Moon,Cheol-Ho Cho,Woojin Jun,Minho Shim,Taeoh Kim,Inwoong Lee,Dongyoon Wee,Jae-Pil Heo
発行日	2025-04-17 15:43:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Prototypes are Balanced Units for Efficient and Effective Partially Relevant Video Retrieval

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー