MAGREF: Masked Guidance for Any-Reference Video Generation

要約

ビデオ生成は、深い生成モデル、特に拡散ベースのアプローチの出現に大きな進歩を遂げました。
ただし、複数の参照科目に基づくビデオ生成は、マルチサブジェクトの一貫性を維持し、高生成品質を確保する上で依然として重要な課題に直面しています。
この論文では、多様な参照画像とテキストプロンプトに条件付けられたコヒーレントマルチサブジェクトビデオ合成を可能にするマスクされたガイダンスを導入する、あらゆる参照ビデオ生成の統一されたフレームワークであるMagrefを提案します。
具体的には、（1）単一のモデルがアーキテクチャの変化なしに、人間、オブジェクト、背景を含むさまざまな主題推論を柔軟に処理できるようにする地域を意識した動的マスキングメカニズムを提案します。
私たちのモデルは、最先端のビデオ生成品質を提供し、単一のサブジェクトトレーニングから複雑なマルチサブジェクトシナリオに一般化し、コヒーレントな統合と個々の被験者を正確に制御し、既存のオープンソースと商業ベースラインを上回ります。
評価を容易にするために、包括的なマルチサブジェクトビデオベンチマークも紹介します。
広範な実験は、私たちのアプローチの有効性を実証し、スケーラブル、制御可能、高忠実度の多面的なマルチサブジェクトビデオ統合への道を開いています。
コードとモデルは、https：//github.com/magref-video/magrefにあります

要約(オリジナル)

Video generation has made substantial strides with the emergence of deep generative models, especially diffusion-based approaches. However, video generation based on multiple reference subjects still faces significant challenges in maintaining multi-subject consistency and ensuring high generation quality. In this paper, we propose MAGREF, a unified framework for any-reference video generation that introduces masked guidance to enable coherent multi-subject video synthesis conditioned on diverse reference images and a textual prompt. Specifically, we propose (1) a region-aware dynamic masking mechanism that enables a single model to flexibly handle various subject inference, including humans, objects, and backgrounds, without architectural changes, and (2) a pixel-wise channel concatenation mechanism that operates on the channel dimension to better preserve appearance features. Our model delivers state-of-the-art video generation quality, generalizing from single-subject training to complex multi-subject scenarios with coherent synthesis and precise control over individual subjects, outperforming existing open-source and commercial baselines. To facilitate evaluation, we also introduce a comprehensive multi-subject video benchmark. Extensive experiments demonstrate the effectiveness of our approach, paving the way for scalable, controllable, and high-fidelity multi-subject video synthesis. Code and model can be found at: https://github.com/MAGREF-Video/MAGREF

arxiv情報

著者	Yufan Deng,Xun Guo,Yuanyang Yin,Jacob Zhiyuan Fang,Yiding Yang,Yizhi Wang,Shenghai Yuan,Angtian Wang,Bo Liu,Haibin Huang,Chongyang Ma
発行日	2025-05-29 17:58:15+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MAGREF: Masked Guidance for Any-Reference Video Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー