PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

要約

最近のビデオ生成の進歩にもかかわらず、既存のモデルは、特に一貫したアイデンティティと相互作用を備えたマルチサブジェクトのカスタマイズのために、まだ細粒の制御可能性を欠いています。
このホワイトペーパーでは、柔軟でアイデンティティ親和的な生成を可能にするマルチサブジェクトビデオカスタマイズフレームワークであるPolyVidiviviviviviviviviviviviviviviviviviviviviviviveを提案します。
サブジェクト画像とテキストエンティティの間に正確な対応を確立するために、正確な接地のために視覚的アイデンティティをテキスト空間に埋め込むVLLMベースのテキストイメージ融合モジュールを設計します。
アイデンティティの保存と被験者の相互作用をさらに強化するために、テキストと画像の埋め込みの間の構造化された双方向融合を可能にする3Dロープベースの拡張モジュールを提案します。
さらに、融合したアイデンティティインジェクションモジュールを開発して、融合したアイデンティティ機能をビデオ生成プロセスに効果的に注入し、アイデンティティドリフトを軽減します。
最後に、MLLMベースの基盤、セグメンテーション、およびクリークベースの主題統合戦略を組み合わせたMLLMベースのデータパイプラインを構築して、高品質のマルチサブジェクトデータを生成し、被験者の区別を効果的に強化し、下流のビデオ生成におけるあいまいさを軽減します。
広範な実験は、Polyvividがアイデンティティの忠実度、ビデオリアリズム、および主題の調整において優れたパフォーマンスを達成し、既存のオープンソースと商業ベースラインを上回ることを示しています。

要約(オリジナル)

Despite recent advances in video generation, existing models still lack fine-grained controllability, especially for multi-subject customization with consistent identity and interaction. In this paper, we propose PolyVivid, a multi-subject video customization framework that enables flexible and identity-consistent generation. To establish accurate correspondences between subject images and textual entities, we design a VLLM-based text-image fusion module that embeds visual identities into the textual space for precise grounding. To further enhance identity preservation and subject interaction, we propose a 3D-RoPE-based enhancement module that enables structured bidirectional fusion between text and image embeddings. Moreover, we develop an attention-inherited identity injection module to effectively inject fused identity features into the video generation process, mitigating identity drift. Finally, we construct an MLLM-based data pipeline that combines MLLM-based grounding, segmentation, and a clique-based subject consolidation strategy to produce high-quality multi-subject data, effectively enhancing subject distinction and reducing ambiguity in downstream video generation. Extensive experiments demonstrate that PolyVivid achieves superior performance in identity fidelity, video realism, and subject alignment, outperforming existing open-source and commercial baselines.

arxiv情報

著者	Teng Hu,Zhentao Yu,Zhengguang Zhou,Jiangning Zhang,Yuan Zhou,Qinglin Lu,Ran Yi
発行日	2025-06-09 15:11:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PolyVivid: Vivid Multi-Subject Video Generation with Cross-Modal Interaction and Enhancement

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー