VideoXum: Cross-modal Visual and Textural Summarization of Videos

要約

タイトル：VideoXum：クロスモーダルビジュアルとテキストの要約

要約：

– ビデオサマリゼーションは、要約クリップまたはテキストナレーションを生成するために、ソースビデオから最も重要な情報を抽出することを目的としています。
– これまでの方法では、出力がビデオまたはテキストであるかに応じて異なる手法が提案されてきましたが、ビジュアル要約とテキスト要約という意味的に関連するタスクの相関性に着目していませんでした。
– 新しいビデオとテキストの要約タスクを提案しました。この目標は、長いビデオから短縮されたビデオクリップと対応するテキスト要約を生成することで、クロスモダル要約として集合的に言及されます。
– この目的を達成するために、まず大規模な人工注釈付きデータセットであるVideoXumを作成します。　
– VTSUM-BILPと呼ばれる新しいエンドツーエンドモデルを設計し、提案されたタスクの課題に対処します。さらに、「VT-CLIPScore」という新しいメトリックを提案して、クロスモーダル要約の意味的な一貫性を評価するのに役立ちます。
– 提案されたモデルは、この新しいタスクで有望なパフォーマンスを発揮し、将来の研究のためのベンチマークを確立します。

要約(オリジナル)

Video summarization aims to distill the most important information from a source video to produce either an abridged clip or a textual narrative. Traditionally, different methods have been proposed depending on whether the output is a video or text, thus ignoring the correlation between the two semantically related tasks of visual summarization and textual summarization. We propose a new joint video and text summarization task. The goal is to generate both a shortened video clip along with the corresponding textual summary from a long video, collectively referred to as a cross-modal summary. The generated shortened video clip and text narratives should be semantically well aligned. To this end, we first build a large-scale human-annotated dataset — VideoXum (X refers to different modalities). The dataset is reannotated based on ActivityNet. After we filter out the videos that do not meet the length requirements, 14,001 long videos remain in our new dataset. Each video in our reannotated dataset has human-annotated video summaries and the corresponding narrative summaries. We then design a novel end-to-end model — VTSUM-BILP to address the challenges of our proposed task. Moreover, we propose a new metric called VT-CLIPScore to help evaluate the semantic consistency of cross-modality summary. The proposed model achieves promising performance on this new task and establishes a benchmark for future research.

arxiv情報

著者	Jingyang Lin,Hang Hua,Ming Chen,Yikang Li,Jenhao Hsiao,Chiuman Ho,Jiebo Luo
発行日	2023-04-06 18:48:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

VideoXum: Cross-modal Visual and Textural Summarization of Videos

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー