Ingredients: Blending Custom Photos with Video Diffusion Transformers

要約

本論文では、ビデオ拡散トランスフォーマー(ビデオ拡散トランスフォーマー)を用いて、複数の特定ID(ID写真)を組み込んでビデオ作品をカスタマイズする強力なフレームワークを紹介します。(ⅳtextbf{i}) グローバルな視点とローカルな視点の両方から、各人物IDの多用途で正確な顔の特徴をキャプチャする顔抽出器、(ⅳtextbf{ii}) ビデオ拡散トランスフォーマーの画像クエリのコンテキスト空間に顔埋め込みをマッピングするマルチスケールプロジェクター、(ⅳtextbf{iii}) 複数のID埋め込みを動的に組み合わせて対応する時空間領域に割り当てるIDルーター。綿密にキュレーションされたテキスト-ビデオデータセットと多段階の学習プロトコルを活用することで、♪texttt{Ingredients}は、カスタム写真をダイナミックでパーソナライズされたビデオコンテンツに変換する際に優れた性能を発揮します。質的な評価により、提案手法の利点が強調され、既存の手法と比較して、Transformerベースのアーキテクチャにおける、より効果的な生成的ビデオ制御ツールに向けた重要な進歩であると位置づけられる。データ、コード、モデルの重みは以下で公開されている：\https://github.com/feizc/Ingredients}。

要約(オリジナル)

This paper presents a powerful framework to customize video creations by incorporating multiple specific identity (ID) photos, with video diffusion Transformers, referred to as \texttt{Ingredients}. Generally, our method consists of three primary modules: (\textbf{i}) a facial extractor that captures versatile and precise facial features for each human ID from both global and local perspectives; (\textbf{ii}) a multi-scale projector that maps face embeddings into the contextual space of image query in video diffusion transformers; (\textbf{iii}) an ID router that dynamically combines and allocates multiple ID embedding to the corresponding space-time regions. Leveraging a meticulously curated text-video dataset and a multi-stage training protocol, \texttt{Ingredients} demonstrates superior performance in turning custom photos into dynamic and personalized video content. Qualitative evaluations highlight the advantages of proposed method, positioning it as a significant advancement toward more effective generative video control tools in Transformer-based architecture, compared to existing methods. The data, code, and model weights are publicly available at: \url{https://github.com/feizc/Ingredients}.

arxiv情報

著者	Zhengcong Fei,Debang Li,Di Qiu,Changqian Yu,Mingyuan Fan
発行日	2025-01-03 12:45:22+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Ingredients: Blending Custom Photos with Video Diffusion Transformers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー