Multi-modal Video Chapter Generation

要約

チャプター生成は、今日のオンラインビデオの実用的な手法になっています。
章のブレークポイントにより、ユーザーは必要な部分をすばやく見つけて、要約の注釈を取得できます。
ただし、このタスク用のパブリックメソッドとデータセットはありません。
この方向に沿った研究を促進するために、Chapter-Gen と呼ばれる新しいデータセットを導入します。これは、注釈付きのチャプター情報を含む約 10,000 のユーザー生成ビデオで構成されています。
当社のデータ収集手順は高速でスケーラブルであり、追加の手動注釈は必要ありません。
このデータセットに加えて、ビデオチャプター生成タスクに特化した効果的なベースラインを設計します。
視覚的なダイナミクスとナレーションテキストを含む、ビデオの 2 つの側面をキャプチャします。
ローカリゼーションとタイトル生成のために、それぞれローカルとグローバルのビデオ機能を解きほぐします。
長いビデオを効率的に解析するために、スキップスライディングウィンドウメカニズムは、潜在的なチャプターをローカライズするように設計されています。
また、クロスアテンションマルチモーダルフュージョンモジュールを開発して、タイトル生成用のローカル機能を集約します。
私たちの実験は、提案されたフレームワークが既存の方法よりも優れた結果を達成することを示しており、同様のタスクの方法設計は微調整後でも直接転送できないことを示しています。
コードとデータセットは https://github.com/czt117/MVCG で入手できます。

要約(オリジナル)

Chapter generation becomes practical technique for online videos nowadays. The chapter breakpoints enable users to quickly find the parts they want and get the summative annotations. However, there is no public method and dataset for this task. To facilitate the research along this direction, we introduce a new dataset called Chapter-Gen, which consists of approximately 10k user-generated videos with annotated chapter information. Our data collection procedure is fast, scalable and does not require any additional manual annotation. On top of this dataset, we design an effective baseline specificlly for video chapters generation task. which captures two aspects of a video,including visual dynamics and narration text. It disentangles local and global video features for localization and title generation respectively. To parse the long video efficiently, a skip sliding window mechanism is designed to localize potential chapters. And a cross attention multi-modal fusion module is developed to aggregate local features for title generation. Our experiments demonstrate that the proposed framework achieves superior results over existing methods which illustrate that the method design for similar task cannot be transfered directly even after fine-tuning. Code and dataset are available at https://github.com/czt117/MVCG.

arxiv情報

著者	Xiao Cao,Zitan Chen,Canyu Le,Lei Meng
発行日	2022-09-26 13:44:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multi-modal Video Chapter Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー