No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

要約

最近の研究では、意味のある内部表現を学ぶことで、生成トレーニングを加速し、拡散変圧器の生成品質を向上させることができることが実証されています。
ただし、既存のアプローチでは、外部および複雑な表現トレーニングフレームワークを導入するか、元の生成トレーニングプロセス中に表現ガイダンスを提供するために、大規模で事前に訓練された表現基盤モデルに依存する必要があります。
この研究では、拡散変圧器に固有の固有の識別プロセスにより、外部表現コンポーネントを必要とせずにそのようなガイダンスを提供できると仮定します。
したがって、私たちは、自己決定的な方法で表現ガイダンスを取得するシンプルでありながら簡単な方法である自己表現アラインメント（SRA）を提案します。
具体的には、SRAは、以前の層の拡散トランスの出力潜在表現を整列させ、後の層の拡散トランスの潜在的な表現を、生成トレーニングプロセスのみの間の全体的な表現学習を徐々に強化するために、低いノイズの低いノイズのノイズの潜在的な表現を整列させます。
実験結果は、SRAをDITSとSITに適用すると、一貫したパフォーマンスの改善が得られることを示しています。
さらに、SRAは、補助的で複雑な表現トレーニングフレームワークに依存するアプローチを大幅に上回るだけでなく、強力な外部表現前のプライアーに大きく依存する方法に匹敵するパフォーマンスを実現します。

要約(オリジナル)

Recent studies have demonstrated that learning a meaningful internal representation can both accelerate generative training and enhance the generation quality of diffusion transformers. However, existing approaches necessitate to either introduce an external and complex representation training framework or rely on a large-scale, pre-trained representation foundation model to provide representation guidance during the original generative training process. In this study, we posit that the unique discriminative process inherent to diffusion transformers enables them to offer such guidance without requiring external representation components. We therefore propose Self-Representation Alignment (SRA), a simple yet straightforward method that obtains representation guidance through a self-distillation manner. Specifically, SRA aligns the output latent representation of the diffusion transformer in the earlier layer with higher noise to that in the later layer with lower noise to progressively enhance the overall representation learning during only the generative training process. Experimental results indicate that applying SRA to DiTs and SiTs yields consistent performance improvements. Moreover, SRA not only significantly outperforms approaches relying on auxiliary, complex representation training frameworks but also achieves performance comparable to methods that are heavily dependent on powerful external representation priors.

arxiv情報

著者	Dengyang Jiang,Mengmeng Wang,Liuzhuozheng Li,Lei Zhang,Haoyu Wang,Wei Wei,Guang Dai,Yanning Zhang,Jingdong Wang
発行日	2025-05-13 16:45:57+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

No Other Representation Component Is Needed: Diffusion Transformers Can Provide Representation Guidance by Themselves

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー