Video Joint Modelling Based on Hierarchical Transformer for Co-summarization

要約

ビデオ要約は、ビデオの要約（ストーリーボードまたはビデオスキム）を自動的に生成することを目的としています。これにより、大規模なビデオの取得とブラウジングが容易になります。
既存の方法のほとんどは、個々のビデオに対してビデオ要約を実行します。これは、類似したビデオ間の相関関係を無視します。
ただし、このような相関関係は、ビデオの理解とビデオの要約にも役立ちます。
この制限に対処するために、ビデオ間のセマンティック依存関係を考慮した、共同要約用のHierarchical Transformer（VJMHT）に基づくビデオジョイントモデリングを提案します。
具体的には、VJMHTはTransformerの2つのレイヤーで構成されます。最初のレイヤーは類似したビデオの個々のショットからセマンティック表現を抽出し、2番目のレイヤーはショットレベルのビデオジョイントモデリングを実行してクロスビデオセマンティック情報を集約します。
これにより、完全なクロスビデオの高レベルのパターンが明示的にモデル化され、個々のビデオの要約のために学習されます。
さらに、トランスフォーマーベースのビデオ表現の再構築が導入され、要約と元のビデオの間の高レベルの類似性が最大化されます。
提案されたモジュールの有効性とF値およびランクベースの評価の観点からのVJMHTの優位性を検証するために、広範な実験が行われます。

要約(オリジナル)

Video summarization aims to automatically generate a summary (storyboard or video skim) of a video, which can facilitate large-scale video retrieval and browsing. Most of the existing methods perform video summarization on individual videos, which neglects the correlations among similar videos. Such correlations, however, are also informative for video understanding and video summarization. To address this limitation, we propose Video Joint Modelling based on Hierarchical Transformer (VJMHT) for co-summarization, which takes into consideration the semantic dependencies across videos. Specifically, VJMHT consists of two layers of Transformer: the first layer extracts semantic representation from individual shots of similar videos, while the second layer performs shot-level video joint modelling to aggregate cross-video semantic information. By this means, complete cross-video high-level patterns are explicitly modelled and learned for the summarization of individual videos. Moreover, Transformer-based video representation reconstruction is introduced to maximize the high-level similarity between the summary and the original video. Extensive experiments are conducted to verify the effectiveness of the proposed modules and the superiority of VJMHT in terms of F-measure and rank-based evaluation.

arxiv情報

著者	Li Haopeng,Ke Qiuhong,Gong Mingming,Zhang Rui
発行日	2022-06-29 06:42:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Video Joint Modelling Based on Hierarchical Transformer for Co-summarization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー