CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

要約

拡散トランス (DiT) は、画像生成における主要なアーキテクチャとなっています。
ただし、トークン単位の関係をモデル化するアテンションメカニズムの 2 次複雑さにより、高解像度画像を生成する際に大幅な遅延が発生します。
この問題に対処するために、この論文では、事前トレーニングされた DiT の複雑さを線形に軽減する線形アテンションメカニズムを目指しています。
私たちは、既存の効率的なアテンションメカニズムの包括的な概要から調査を開始し、事前トレーニング済み DiT の線形化を成功させるために重要な 4 つの重要な要素 (局所性、定式化の一貫性、高ランクアテンションマップ、および特徴の完全性) を特定します。
これらの洞察に基づいて、CLEAR と呼ばれる畳み込みのようなローカルアテンション戦略を導入します。これは、機能の相互作用を各クエリトークンの周囲のローカルウィンドウに制限し、線形の複雑さを実現します。
私たちの実験では、わずか 10,000 個の自己生成サンプルでアテンション層を 10,000 回の反復で微調整することで、事前トレーニングされた DiT から線形複雑さを持つ生徒モデルに知識を効果的に伝達でき、教師モデルと同等の結果が得られることが示されました。
同時に、アテンション計算を 99.5% 削減し、8K 解像度画像の生成を 6.3 倍高速化します。
さらに、さまざまなモデルやプラグインにわたるゼロショット汎化や、マルチ GPU 並列推論のサポートの改善など、抽出されたアテンション層の有利な特性を調査します。
モデルとコードは、https://github.com/Huage001/CLEAR から入手できます。

要約(オリジナル)

Diffusion Transformers (DiT) have become a leading architecture in image generation. However, the quadratic complexity of attention mechanisms, which are responsible for modeling token-wise relationships, results in significant latency when generating high-resolution images. To address this issue, we aim at a linear attention mechanism in this paper that reduces the complexity of pre-trained DiTs to linear. We begin our exploration with a comprehensive summary of existing efficient attention mechanisms and identify four key factors crucial for successful linearization of pre-trained DiTs: locality, formulation consistency, high-rank attention maps, and feature integrity. Based on these insights, we introduce a convolution-like local attention strategy termed CLEAR, which limits feature interactions to a local window around each query token, and thus achieves linear complexity. Our experiments indicate that, by fine-tuning the attention layer on merely 10K self-generated samples for 10K iterations, we can effectively transfer knowledge from a pre-trained DiT to a student model with linear complexity, yielding results comparable to the teacher model. Simultaneously, it reduces attention computations by 99.5% and accelerates generation by 6.3 times for generating 8K-resolution images. Furthermore, we investigate favorable properties in the distilled attention layers, such as zero-shot generalization cross various models and plugins, and improved support for multi-GPU parallel inference. Models and codes are available here: https://github.com/Huage001/CLEAR.

arxiv情報

著者	Songhua Liu,Zhenxiong Tan,Xinchao Wang
発行日	2024-12-20 17:57:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

CLEAR: Conv-Like Linearization Revs Pre-Trained Diffusion Transformers Up

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー