VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

要約

ビデオ生成モデルは、テキストからビデオへのタスクで顕著な進歩を達成しています。
これらのモデルは通常、非常に詳細で慎重に作成された説明を備えたテキストビデオペアでトレーニングされますが、推論中の実際のユーザー入力はしばしば簡潔、曖昧、または不十分に構造化されています。
このギャップにより、高品質のビデオを生成するために迅速な最適化が重要になります。
現在の方法は、多くの場合、大規模な言語モデル（LLMS）に依存して、コンテキスト内の学習を通じてプロンプトを改良しますが、いくつかの制限に苦しんでいます。ユーザーの意図を歪めたり、重要な詳細を省略したり、安全リスクを導入したりする場合があります。
さらに、最終的なビデオ品質への影響を考慮せずにプロンプトを最適化します。
これらの問題に対処するために、VPOを紹介します。VPOは、無害、正確性、有用性という3つのコア原則に基づいてプロンプトを最適化する原則的なフレームワークです。
生成されたプロンプトは、ユーザーの意図を忠実に保存し、さらに重要なことに、生成されたビデオの安全性と品質を向上させることです。
これを達成するために、VPOは2段階の最適化アプローチを採用しています。
まず、安全性とアライメントの原則に基づいて、監視された微調整（SFT）データセットを構築および改良します。
第二に、テキストレベルとビデオレベルの両方のフィードバックを導入して、SFTモデルを優先学習でさらに最適化します。
当社の広範な実験は、VPOがベースライン方法と比較して安全性、アラインメント、およびビデオの品質を大幅に改善することを示しています。
さらに、VPOはビデオ生成モデル全体で強力な一般化を示しています。
さらに、VPOがビデオ生成モデルのRLHFメソッドを上回り、RLHFメソッドと組み合わせることができることを実証し、ビデオ生成モデルの調整におけるVPOの有効性を強調しています。
私たちのコードとデータは、https://github.com/thu-coai/vpoで公開されています。

要約(オリジナル)

Video generation models have achieved remarkable progress in text-to-video tasks. These models are typically trained on text-video pairs with highly detailed and carefully crafted descriptions, while real-world user inputs during inference are often concise, vague, or poorly structured. This gap makes prompt optimization crucial for generating high-quality videos. Current methods often rely on large language models (LLMs) to refine prompts through in-context learning, but suffer from several limitations: they may distort user intent, omit critical details, or introduce safety risks. Moreover, they optimize prompts without considering the impact on the final video quality, which can lead to suboptimal results. To address these issues, we introduce VPO, a principled framework that optimizes prompts based on three core principles: harmlessness, accuracy, and helpfulness. The generated prompts faithfully preserve user intents and, more importantly, enhance the safety and quality of generated videos. To achieve this, VPO employs a two-stage optimization approach. First, we construct and refine a supervised fine-tuning (SFT) dataset based on principles of safety and alignment. Second, we introduce both text-level and video-level feedback to further optimize the SFT model with preference learning. Our extensive experiments demonstrate that VPO significantly improves safety, alignment, and video quality compared to baseline methods. Moreover, VPO shows strong generalization across video generation models. Furthermore, we demonstrate that VPO could outperform and be combined with RLHF methods on video generation models, underscoring the effectiveness of VPO in aligning video generation models. Our code and data are publicly available at https://github.com/thu-coai/VPO.

arxiv情報

著者	Jiale Cheng,Ruiliang Lyu,Xiaotao Gu,Xiao Liu,Jiazheng Xu,Yida Lu,Jiayan Teng,Zhuoyi Yang,Yuxiao Dong,Jie Tang,Hongning Wang,Minlie Huang
発行日	2025-03-26 12:28:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー