A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

要約

このビデオの時代において、自動ビデオ編集技術は、作業負荷を軽減し、人間の編集者の要件を下げることができるため、産業界および学界からますます注目を集めています。
既存の自動編集システムは、サッカーの試合の放送など、主にシーンまたはイベントに特化していますが、さまざまなシーンやイベントをカバーする映画やビデオブログの編集など、一般的な編集のための自動システムはこれまでほとんど研究されておらず、イベントを変換します。
一般的なシーンに対する主導的な編集方法は簡単ではありません。
この論文では、一般的な編集のための 2 段階のスキームを提案します。
まず、シーン固有の特徴を抽出する以前の作業とは異なり、事前トレーニングされた視覚言語モデル (VLM) を利用して、編集関連の表現を編集コンテキストとして抽出します。
さらに、プロのようなビデオと、単純なガイドラインで生成された自動作品との間のギャップを埋めるために、編集問題を定式化し、より適切な逐次編集の決定を下せるように仮想エディタをトレーニングするための強化学習 (RL) ベースの編集フレームワークを提案します。
最後に、実際の映画データセットを使用した、より一般的な編集タスクで提案された方法を評価します。
実験結果は、提案されたコンテキスト表現と RL ベースの編集フレームワークの学習能力の有効性と利点を示しています。

要約(オリジナル)

In this era of videos, automatic video editing techniques attract more and more attention from industry and academia since they can reduce workloads and lower the requirements for human editors. Existing automatic editing systems are mainly scene- or event-specific, e.g., soccer game broadcasting, yet the automatic systems for general editing, e.g., movie or vlog editing which covers various scenes and events, were rarely studied before, and converting the event-driven editing method to a general scene is nontrivial. In this paper, we propose a two-stage scheme for general editing. Firstly, unlike previous works that extract scene-specific features, we leverage the pre-trained Vision-Language Model (VLM) to extract the editing-relevant representations as editing context. Moreover, to close the gap between the professional-looking videos and the automatic productions generated with simple guidelines, we propose a Reinforcement Learning (RL)-based editing framework to formulate the editing problem and train the virtual editor to make better sequential editing decisions. Finally, we evaluate the proposed method on a more general editing task with a real movie dataset. Experimental results demonstrate the effectiveness and benefits of the proposed context representation and the learning ability of our RL-based editing framework.

arxiv情報

著者	Panwen Hu,Nan Xiao,Feifei Li,Yongquan Chen,Rui Huang
発行日	2024-11-07 18:20:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Reinforcement Learning-Based Automatic Video Editing Method Using Pre-trained Vision-Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー