Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

要約

ビデオセマンティックロールラベリング (VidSRL) は、予測引数イベント構造とイベント間の相互関係を認識することにより、特定のビデオから顕著なイベントを検出することを目的としています。
最近の取り組みにより VidSRL の手法が提案されていますが、それらはほとんど 2 つの重要な欠点を抱えています。1 つは、きめの細かい空間シーン認識の欠如と、ビデオの時間性のモデリングが不十分であることです。
この目的に向けて、この研究では、既存の動的シーングラフ構造に基づく、新しい全体的な時空間シーングラフ (つまり、HostSG) 表現を検討します。これは、VidSRL 用のビデオのきめの細かい空間セマンティクスと時間ダイナミクスの両方を適切にモデル化します。
HostSG 上に構築された、ニッチをターゲットとした VidSRL フレームワークを紹介します。
シーンイベントマッピングメカニズムは、基礎となるシーン構造と高レベルのイベントセマンティック構造の間のギャップを埋めるために最初に設計され、その結果、全体的な階層的なシーンイベント (ICE と呼ばれる) グラフ構造が得られます。
さらに、構造全体の表現が最終タスクの要求に最もよく一致するように、反復構造の改良を実行して ICE グラフを最適化します。
最後に、VidSRL の 3 つのサブタスク予測が共同でデコードされ、エンドツーエンドのパラダイムによりエラーの伝播が効果的に回避されます。
ベンチマークデータセットでは、私たちのフレームワークは現在の最高パフォーマンスのモデルよりも大幅に向上しています。
私たちの方法の進歩をよりよく理解するために、さらなる分析が示されています。

要約(オリジナル)

Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods.

arxiv情報

著者	Yu Zhao,Hao Fei,Yixin Cao,Bobo Li,Meishan Zhang,Jianguo Wei,Min Zhang,Tat-Seng Chua
発行日	2023-08-09 17:20:14+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー