Commonsense for Zero-Shot Natural Language Video Localization

要約

ゼロショット自然言語ビデオローカリゼーション (NLVL) 手法は、ビデオセグメントと疑似クエリアノテーションを動的に生成することにより、生のビデオデータのみを使用して NLVL モデルをトレーニングする際に有望な結果を示しました。
ただし、既存の疑似クエリにはソースビデオの基礎が欠けていることが多く、その結果、構造化されていないばらばらのコンテンツが生成されます。
この論文では、ゼロショット NLVL における常識推論の有効性を調査します。
具体的には、コモンセンスを活用して、コモンセンス拡張モジュールを介してビデオと生成された疑似クエリの間のギャップを埋めるゼロショット NLVL フレームワークである CORONET を紹介します。
CORONET は、グラフコンボリューションネットワーク (GCN) を使用して、ナレッジグラフから抽出されたビデオに条件付けされた常識情報をエンコードし、クロスアテンションメカニズムを使用して、ローカリゼーションの前にエンコードされたビデオと疑似クエリ表現を強化します。
2 つのベンチマークデータセットの経験的評価を通じて、CORONET がゼロショットベースラインと弱く監視されたベースラインの両方を上回り、さまざまなリコールしきい値全体で最大 32.13%、mIoU で最大 6.33% の改善を達成したことを実証しました。
これらの結果は、ゼロショット NLVL に常識的な推論を活用することの重要性を強調しています。

要約(オリジナル)

Zero-shot Natural Language-Video Localization (NLVL) methods have exhibited promising results in training NLVL models exclusively with raw video data by dynamically generating video segments and pseudo-query annotations. However, existing pseudo-queries often lack grounding in the source video, resulting in unstructured and disjointed content. In this paper, we investigate the effectiveness of commonsense reasoning in zero-shot NLVL. Specifically, we present CORONET, a zero-shot NLVL framework that leverages commonsense to bridge the gap between videos and generated pseudo-queries via a commonsense enhancement module. CORONET employs Graph Convolution Networks (GCN) to encode commonsense information extracted from a knowledge graph, conditioned on the video, and cross-attention mechanisms to enhance the encoded video and pseudo-query representations prior to localization. Through empirical evaluations on two benchmark datasets, we demonstrate that CORONET surpasses both zero-shot and weakly supervised baselines, achieving improvements up to 32.13% across various recall thresholds and up to 6.33% in mIoU. These results underscore the significance of leveraging commonsense reasoning for zero-shot NLVL.

arxiv情報

著者	Meghana Holla,Ismini Lourentzou
発行日	2023-12-29 01:42:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Commonsense for Zero-Shot Natural Language Video Localization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー