ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

要約

継続的な視覚情報をモデル化するための自己回帰パラダイムと拡散パラダイムを革新的に組み合わせた、新しい自己回帰ブロックごとの条件付き拡散トランスであるACDITを提示します。
ブロックワイズの自己回帰ユニットを導入することにより、ACDITは、離散トークン化の制限をバイパスして、トークンごとの自己収集と完全なシーケンス拡散の間の柔軟な補間を提供します。
各ブロックの生成は、前のブロックを条件付けられた条件付き拡散プロセスとして定式化されます。
ACDITは、トレーニング中に標準の拡散トランスにスキップ因子注意マスク（詐欺）を作成するのと同じくらい簡単に実装できます。
推論中、プロセスは、KVキャッシュを最大限に活用できる拡散除去と自己回帰デコードの間を反復します。
ACDITは、画像およびビデオ生成タスクの同様のモデルスケールの下で、すべての自己回帰ベースラインの中で最適なパフォーマンスを発揮することを示しています。
また、自己回帰モデリングの恩恵を受けると、拡散目標で訓練されているにもかかわらず、視覚的理解タスクで前処理されたACDITを転送できることを実証します。
自己回帰モデリングと拡散のトレードオフの分析は、長老視覚生成タスクで使用されるACDITの可能性を示しています。
ACDITは、視覚的な自己回帰の生成に関する新しい視点を提供し、統一されたモデルの新しい道のロックを解除することを願っています。

要約(オリジナル)

We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for modeling continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as creating a Skip-Causal Attention Mask (SCAM) on standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We show that ACDiT performs best among all autoregressive baselines under similar model scales on image and video generation tasks. We also demonstrate that benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the diffusion objective. The analysis of the trade-off between autoregressive modeling and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and unlocks new avenues for unified models.

arxiv情報

著者	Jinyi Hu,Shengding Hu,Yuxuan Song,Yufei Huang,Mingxuan Wang,Hao Zhou,Zhiyuan Liu,Wei-Ying Ma,Maosong Sun
発行日	2025-03-13 16:29:17+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー