ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

要約

現在の拡散ベースのテキストからビデオへのメソッドは、単一ショットの短いビデオクリップを作成することに限定されており、同じ文字が同じまたは異なる背景で異なるアクティビティを実行する個別の遷移でマルチショットビデオを生成する機能がありません。
この制限に対処するために、データセットコレクションパイプラインとアーキテクチャ拡張機能をビデオ拡散モデルに含むフレームワークを提案して、テキストからマルチショットのビデオ生成を可能にします。
当社のアプローチにより、すべてのショットのすべてのフレームにわたって完全に注意を払った単一のビデオとしてマルチショットビデオの生成が可能になり、キャラクターとバックグラウンドの一貫性が確保され、ユーザーがショット固有の条件付けを通じてショットの数、期間、コンテンツを制御できます。
これは、遷移トークンをテキスト間モデルに組み込み、新しいショットが始まるフレームを制御し、トランジショントークンの効果を制御し、ショット固有のプロンプトを可能にするローカルな注意マスキング戦略を制御することで達成されます。
トレーニングデータを取得するために、既存のシングルショットビデオデータセットからマルチショットビデオデータセットを構築するための新しいデータ収集パイプラインを提案します。
広範な実験は、数千の反復の事前に訓練されたテキストからビデオへのモデルを微調整するだけで、モデルがショット固有のコントロールを備えたマルチショットビデオを生成し、ベースラインを上回ることができることを示しています。
詳細については、https：//shotadapter.github.io/をご覧ください。

要約(オリジナル)

Current diffusion-based text-to-video methods are limited to producing short video clips of a single shot and lack the capability to generate multi-shot videos with discrete transitions where the same character performs distinct activities across the same or different backgrounds. To address this limitation we propose a framework that includes a dataset collection pipeline and architectural extensions to video diffusion models to enable text-to-multi-shot video generation. Our approach enables generation of multi-shot videos as a single video with full attention across all frames of all shots, ensuring character and background consistency, and allows users to control the number, duration, and content of shots through shot-specific conditioning. This is achieved by incorporating a transition token into the text-to-video model to control at which frames a new shot begins and a local attention masking strategy which controls the transition token’s effect and allows shot-specific prompting. To obtain training data we propose a novel data collection pipeline to construct a multi-shot video dataset from existing single-shot video datasets. Extensive experiments demonstrate that fine-tuning a pre-trained text-to-video model for a few thousand iterations is enough for the model to subsequently be able to generate multi-shot videos with shot-specific control, outperforming the baselines. You can find more details in https://shotadapter.github.io/

arxiv情報

著者	Ozgur Kara,Krishna Kumar Singh,Feng Liu,Duygu Ceylan,James M. Rehg,Tobias Hinz
発行日	2025-05-12 15:22:28+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ShotAdapter: Text-to-Multi-Shot Video Generation with Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー