Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

要約

テキストからビデオへの拡散モデルの顕著な進歩により、光エリスティックな世代が可能になりますが、生成されたビデオの内容には不自然な動きや変形、逆再生、動かないシーンが含まれます。
最近、アラインメントの問題が大きな注目を集めており、コンテンツの良さに基づいて拡散モデルの出力を操作します。
フレーム方向に沿って知覚品質を改善するための大きな部屋があるため、最適化するメトリックとビデオ生成でそれらを最適化する方法に対処する必要があります。
このホワイトペーパーでは、Lookahead推定器を使用した拡散潜在ビーム検索を提案します。これにより、推論時に特定のアライメント報酬を最大化するために、より良い拡散潜在性を選択できます。
次に、プロンプトへのアライメントを考慮して知覚ビデオ品質の改善には、既存のメトリックに重み付けすることにより報酬のキャリブレーションが必要であることを指摘します。
ビジョン言語モデルを人間のプロキシとして使用して出力を評価する場合、ビデオの自然さを定量化するための以前のメトリックの多くは、評価と常に相関するわけではなく、評価プロンプトの動的記述の程度にも依存します。
私たちの方法は、モデルパラメーターの更新なしに、較正された報酬に基づいて知覚品質を改善し、貪欲な検索とベストnサンプリングと比較して最高の世代を出力することを実証します。
検索予算、報酬の見積もりの手順、および逆拡散プロセスでの除去ステップのaxesが、推論時間計算を割り当てる実用的なガイドラインを提供します。

要約(オリジナル)

The remarkable progress in text-to-video diffusion models enables photorealistic generations, although the contents of the generated video often include unnatural movement or deformation, reverse playback, and motionless scenes. Recently, an alignment problem has attracted huge attention, where we steer the output of diffusion models based on some quantity on the goodness of the content. Because there is a large room for improvement of perceptual quality along the frame direction, we should address which metrics we should optimize and how we can optimize them in the video generation. In this paper, we propose diffusion latent beam search with lookahead estimator, which can select better diffusion latent to maximize a given alignment reward, at inference time. We then point out that the improvement of perceptual video quality considering the alignment to prompts requires reward calibration by weighting existing metrics. When evaluating outputs by using vision language models as a proxy of humans, many previous metrics to quantify the naturalness of video do not always correlate with evaluation and also depend on the degree of dynamic descriptions in evaluation prompts. We demonstrate that our method improves the perceptual quality based on the calibrated reward, without model parameter update, and outputs the best generation compared to greedy search and best-of-N sampling. We provide practical guidelines on which axes, among search budget, lookahead steps for reward estimate, and denoising steps, in the reverse diffusion process, we should allocate the inference-time computation.

arxiv情報

著者	Yuta Oshima,Masahiro Suzuki,Yutaka Matsuo,Hiroki Furuta
発行日	2025-01-31 16:09:30+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Inference-Time Text-to-Video Alignment with Diffusion Latent Beam Search

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー