SnAG: Scalable and Accurate Video Grounding

要約

ビデオ内のテキスト説明の時間的根拠は、視覚言語学習とビデオ理解における中心的な問題です。
既存の手法は、スケーラビリティよりも精度を優先することが多く、短いビデオ内で少数のテキストクエリのみを基にするように最適化されており、数百のクエリを含む長いビデオにスケールアップすることはできません。
この論文では、ビデオグラウンディングモデルのスケーラビリティに対するクロスモーダルフュージョンの影響を研究します。
私たちの分析では、多くのテキストクエリを含む長い形式のビデオに対する、よりコスト効率の高いフュージョンスキームとしてレイトフュージョンが確立されています。
さらに、それは効率的なトレーニングのための新しいビデオ中心のサンプリングスキームにつながります。
これらの調査結果に基づいて、スケーラブルで正確なビデオグラウンディングのためのシンプルなベースラインである SnAG を紹介します。
追加機能がなければ、SnAG は CONE よりも 43% 精度が高く、1.5 倍高速です。CONE は、短いビデオで非常に競争力のある結果を達成しながら、困難な MAD データセットに基づく長編ビデオの最先端技術です。

要約(オリジナル)

Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability — they have been optimized for grounding only a few text queries within short videos, and fail to scale up to long videos with hundreds of queries. In this paper, we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover, it leads us to a novel, video-centric sampling scheme for efficient training. Based on these findings, we present SnAG, a simple baseline for scalable and accurate video grounding. Without bells and whistles, SnAG is 43% more accurate and 1.5x faster than CONE, a state of the art for long-form video grounding on the challenging MAD dataset, while achieving highly competitive results on short videos.

arxiv情報

著者	Fangzhou Mu,Sicheng Mo,Yin Li
発行日	2024-04-05 17:02:31+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SnAG: Scalable and Accurate Video Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー