Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

要約

テンポラルセンテンスグラウンディング (TSG) は、センテンスクエリによってトリミングされていないビデオから特定のセグメントの時間的境界を識別することを目的としています。
既存のすべての作品は、最初にスパースサンプリング戦略を利用して一定数のビデオフレームを抽出し、次に推論のためにクエリ文とのマルチモーダルインタラクションを実行します。
ただし、これらの方法は 2 つの不可欠な問題を見落としていると主張します。
ビデオダウンサンプリングプロセスでは、これら 2 つのフレームが失われ、隣接する無関係なフレームが新しい境界として取得される場合があります。
2) 推論バイアス: このような誤った新しい境界フレームは、フレームとクエリの相互作用中の推論バイアスにもつながり、モデルの一般化能力を低下させます。
上記の制限を軽減するために、このホワイトペーパーでは、TSG 用の新しいシャムサンプリングおよび推論ネットワーク (SSRN) を提案します。これは、シャムサンプリングメカニズムを導入して、追加のコンテキストフレームを生成し、新しい境界を強化および改良します。
具体的には、これらのフレーム間の相互関係を学習し、より正確なフレームクエリの推論のために境界にソフトラベルを生成する推論戦略が開発されています。
このようなメカニズムは、欠落している連続した視覚的セマンティクスを、サンプリングされたスパースフレームに補足して、きめの細かいアクティビティを理解することもできます。
広範な実験により、3 つの困難なデータセットに対する SSRN の有効性が実証されています。

要約(オリジナル)

Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.

arxiv情報

著者	Jiahao Zhu,Daizong Liu,Pan Zhou,Xing Di,Yu Cheng,Song Yang,Wenzheng Xu,Zichuan Xu,Yao Wan,Lichao Sun,Zeyu Xiong
発行日	2023-01-02 03:38:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー