Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding

要約

この作業では、弱く監視されている空間的ビデオ接地（WSTVG）に焦点を当てています。
これは、ボックスの監督のないテキストクエリに基づいて、特別な被験者を空間的に文化的にローカライズすることを目的としたマルチモーダルタスクです。
接地タスクのためのマルチモーダルファンデーションモデルの最近の進歩に動機付けられ、最初にWSTVGの最先端のオブジェクト検出モデルの可能性を調査します。
堅牢なゼロショット機能にもかかわらず、私たちの適応は、一貫性のない時間的予測、複雑なクエリの不十分な理解、困難なシナリオへの適応における課題など、大きな制限を明らかにしています。
私たちは、これらの制限を克服するために設計された新しいアプローチであるCospal（文脈的自己ペース学習）を提案します。
Cospalは、3つのコアコンポーネントを統合します。（1）Tubelet Frase Grounding（TPG）。これは、テキストクエリをTubeletにリンクすることにより、時空間予測を導入します。
（2）コンテキスト紹介の接地（CRG）。これは、コンテキスト情報を抽出してオブジェクトの識別を経時的に絞り込むことにより、複雑なクエリの理解を向上させる。
（3）セルフペースのシーン理解（SPS）は、タスクの難易度を徐々に増加させるトレーニングパラダイムであり、粗から微調整された理解に移行することにより、モデルが複雑なシナリオに適応できるようにします。

要約(オリジナル)

In this work, we focus on Weakly Supervised Spatio-Temporal Video Grounding (WSTVG). It is a multimodal task aimed at localizing specific subjects spatio-temporally based on textual queries without bounding box supervision. Motivated by recent advancements in multi-modal foundation models for grounding tasks, we first explore the potential of state-of-the-art object detection models for WSTVG. Despite their robust zero-shot capabilities, our adaptation reveals significant limitations, including inconsistent temporal predictions, inadequate understanding of complex queries, and challenges in adapting to difficult scenarios. We propose CoSPaL (Contextual Self-Paced Learning), a novel approach which is designed to overcome these limitations. CoSPaL integrates three core components: (1) Tubelet Phrase Grounding (TPG), which introduces spatio-temporal prediction by linking textual queries to tubelets; (2) Contextual Referral Grounding (CRG), which improves comprehension of complex queries by extracting contextual information to refine object identification over time; and (3) Self-Paced Scene Understanding (SPS), a training paradigm that progressively increases task difficulty, enabling the model to adapt to complex scenarios by transitioning from coarse to fine-grained understanding.

arxiv情報

著者	Akash Kumar,Zsolt Kira,Yogesh Singh Rawat
発行日	2025-01-28 16:25:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Contextual Self-paced Learning for Weakly Supervised Spatio-Temporal Video Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー