Language-free Training for Zero-shot Video Grounding

要約

トリミングされていないビデオと、ビデオ内の特定の一時的な瞬間を表す言語クエリが与えられた場合、ビデオグラウンディングは、テキストとビデオを同時に理解することによって時間間隔をローカライズすることを目的としています。
最も困難な問題の 1 つは、自然言語形式のビデオキャプションとそれに対応する時間領域を含む、非常に時間とコストのかかる注釈の収集です。
このホワイトペーパーでは、ゼロショット設定でのビデオグラウンディングのためのシンプルでありながら新しいトレーニングフレームワークを提示します。これは、アノテーションなしでビデオデータのみを使用してネットワークを学習します。
最近の言語に依存しないパラダイム、つまり言語データを使用しないトレーニングに触発されて、偽の (疑似) テキストクエリの生成を自然言語形式に強制することなく、ネットワークをトレーニングします。
具体的には、仮想的な正解として時間間隔を選択し、その間隔で選択された視覚的特徴を言語特徴として考慮することにより、ビデオグラウンディングモデルを学習する方法を提案します。
CLIPのスペース。
広範な実験により、言語を使用しないトレーニングフレームワークの卓越性が実証され、既存のゼロショットビデオグラウンディングメソッドや、2 つの標準データセットで大きなマージンを持ついくつかの教師付きの弱いアプローチよりも優れています。

要約(オリジナル)

Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously. One of the most challenging issues is an extremely time- and cost-consuming annotation collection, including video captions in a natural language form and their corresponding temporal regions. In this paper, we present a simple yet novel training framework for video grounding in the zero-shot setting, which learns a network with only video data without any annotation. Inspired by the recent language-free paradigm, i.e. training without language data, we train the network without compelling the generation of fake (pseudo) text queries into a natural language form. Specifically, we propose a method for learning a video grounding model by selecting a temporal interval as a hypothetical correct answer and considering the visual feature selected by our method in the interval as a language feature, with the help of the well-aligned visual-language space of CLIP. Extensive experiments demonstrate the prominence of our language-free training framework, outperforming the existing zero-shot video grounding method and even several weakly-supervised approaches with large margins on two standard datasets.

arxiv情報

著者	Dahye Kim,Jungin Park,Jiyoung Lee,Seongheon Park,Kwanghoon Sohn
発行日	2022-10-24 06:55:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language-free Training for Zero-shot Video Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー