Is it Really Negative? Evaluating Natural Language Video Localization Performance on Multiple Reliable Videos Pool

要約

近年のマルチメディアコンテンツの急増に伴い、複数のビデオから特定の自然言語クエリに一致するビデオモーメントを検出することを目的としたビデオコーパスモーメント検索 (VCMR) が重大な問題となっています。
ただし、既存の VCMR 研究では、特定のクエリとペアになっていないすべてのビデオをネガティブとみなし、ネガティブビデオセットを構築するときに偽陰性が含まれる可能性を無視しているため、重大な制限があります。
この論文では、大規模なビデオセット内のビデオフレームの位置を特定し、ポジティブビデオとネガティブビデオを誤って区別する可能性を軽減することを目的とした MVMR (Massive Videos Moment Retrieval) タスクを提案します。
このタスクのために、既存のビデオモーメント検索データセットに対してテキストおよび視覚的な意味一致評価方法を採用することによる自動データセット構築フレームワークを提案し、3 つの MVMR データセットを紹介します。
MVMR タスクを解決するために、我々はさらに、信頼性が高く有益なネガティブを選択的に識別する交差方向対比学習を採用する強力な手法 CroCs を提案し、MVMR タスクにおけるモデルのロバスト性を強化します。
導入されたデータセットの実験結果から、既存のビデオモーメント検索モデルはネガティブなビデオフレームによって簡単に気を散らされるのに対し、私たちのモデルは顕著なパフォーマンスを示すことが明らかになりました。

要約(オリジナル)

With the explosion of multimedia content in recent years, Video Corpus Moment Retrieval (VCMR), which aims to detect a video moment that matches a given natural language query from multiple videos, has become a critical problem. However, existing VCMR studies have a significant limitation since they have regarded all videos not paired with a specific query as negative, neglecting the possibility of including false negatives when constructing the negative video set. In this paper, we propose an MVMR (Massive Videos Moment Retrieval) task that aims to localize video frames within a massive video set, mitigating the possibility of falsely distinguishing positive and negative videos. For this task, we suggest an automatic dataset construction framework by employing textual and visual semantic matching evaluation methods on the existing video moment search datasets and introduce three MVMR datasets. To solve MVMR task, we further propose a strong method, CroCs, which employs cross-directional contrastive learning that selectively identifies the reliable and informative negatives, enhancing the robustness of a model on MVMR task. Experimental results on the introduced datasets reveal that existing video moment search models are easily distracted by negative video frames, whereas our model shows significant performance.

arxiv情報

著者	Nakyeong Yang,Minsung Kim,Seunghyun Yoon,Joongbo Shin,Kyomin Jung
発行日	2024-03-18 08:55:36+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Is it Really Negative? Evaluating Natural Language Video Localization Performance on Multiple Reliable Videos Pool

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー