Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

要約

ビデオの一時的なグラウンディングは、クエリの説明に一致するビデオセグメントを特定することを目的としています。
最近の短編ビデオ (\textit{e.g.}、分単位) の進歩にもかかわらず、長いビデオ (\textit{e.g.}、時間単位) の時間的根拠はまだ初期段階にあります。
この課題に対処するために、スライディングウィンドウを使用するのが一般的な方法ですが、ウィンドウ内のフレーム数が限られているため、非効率的で柔軟性に欠ける場合があります。
この作業では、\textbf{1 回限りの} ネットワーク実行で数時間のビデオをモデル化できる、高速な時間グラウンディングのためのエンドツーエンドのフレームワークを提案します。
私たちのパイプラインは粗いものから細かいものへと定式化されており、最初に重複していないビデオクリップ (\textit{i.e.}、アンカー) からコンテキスト知識を抽出し、次に詳細なコンテンツ知識でクエリに高度に応答するアンカーを補足します。
.
非常に高いパイプライン効率に加えて、私たちのアプローチのもう 1 つの利点は、ビデオ全体を全体としてモデル化することで、長距離の時間相関をキャプチャできることです。これにより、より正確なグラウンディングが容易になります。
実験結果は、長い形式のビデオデータセット MAD と Ego4d で、私たちの方法が最先端の方法よりも大幅に優れており、\textbf{14.6$\times$} / \textbf{102.8$\times$} 高い値を達成することを示唆しています。
それぞれの効率。
プロジェクトは \url{https://github.com/afcedf/SOONet.git} にあります。

要約(オリジナル)

Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (\textit{e.g.}, in minutes), temporal grounding in long videos (\textit{e.g.}, in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number of frames within the window. In this work, we propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with \textbf{one-time} network execution. Our pipeline is formulated in a coarse-to-fine manner, where we first extract context knowledge from non-overlapped video clips (\textit{i.e.}, anchors), and then supplement the anchors that highly response to the query with detailed content knowledge. Besides the remarkably high pipeline efficiency, another advantage of our approach is the capability of capturing long-range temporal correlation, thanks to modeling the entire video as a whole, and hence facilitates more accurate grounding. Experimental results suggest that, on the long-form video datasets MAD and Ego4d, our method significantly outperforms state-of-the-arts, and achieves \textbf{14.6$\times$} / \textbf{102.8$\times$} higher efficiency respectively. Project can be found at \url{https://github.com/afcedf/SOONet.git}.

arxiv情報

著者	Yulin Pan,Xiangteng He,Biao Gong,Yiliang Lv,Yujun Shen,Yuxin Peng,Deli Zhao
発行日	2023-03-22 12:41:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー