Scaling Down Text Encoders of Text-to-Image Diffusion Models

要約

拡散モデルのテキストエンコーダーは急速に進化し、クリップからT5-XXLに移行しています。
この進化により、モデルの複雑なプロンプトを理解してテキストを生成する能力が大幅に向上しましたが、パラメーターの数の大幅な増加にもつながります。
T5シリーズエンコーダーは、かなりの量の非視覚データを含むC4 Natural Language Corpusでトレーニングされているにもかかわらず、T5エンコーダーを備えた拡散モデルは、表現力の冗長性を示す非視覚プロンプトに応答しません。
したがって、それは重要な質問を提起します：「私たちは本当にこのような大きなテキストエンコーダーが必要ですか？」
答えを追求するために、視力ベースの知識蒸留を採用して、一連のT5エンコーダーモデルを訓練します。
その機能を完全に継承するために、画質、セマンティック理解、テキストレンダリングの3つの基準に基づいてデータセットを構築しました。
我々の結果は、蒸留されたT5ベースモデルがT5-XXLによって生成されたものと同等の品質の画像を生成しながら、サイズが50倍小さくなっているというスケーリングダウンパターンを示しています。
このモデルサイズの縮小は、フラックスやSD3などの最先端モデルを実行するためのGPU要件を大幅に低下させるため、高品質のテキストからイメージへの生成がよりアクセスしやすくなります。

要約(オリジナル)

Text encoders in diffusion models have rapidly evolved, transitioning from CLIP to T5-XXL. Although this evolution has significantly enhanced the models’ ability to understand complex prompts and generate text, it also leads to a substantial increase in the number of parameters. Despite T5 series encoders being trained on the C4 natural language corpus, which includes a significant amount of non-visual data, diffusion models with T5 encoder do not respond to those non-visual prompts, indicating redundancy in representational power. Therefore, it raises an important question: ‘Do we really need such a large text encoder?’ In pursuit of an answer, we employ vision-based knowledge distillation to train a series of T5 encoder models. To fully inherit its capabilities, we constructed our dataset based on three criteria: image quality, semantic understanding, and text-rendering. Our results demonstrate the scaling down pattern that the distilled T5-base model can generate images of comparable quality to those produced by T5-XXL, while being 50 times smaller in size. This reduction in model size significantly lowers the GPU requirements for running state-of-the-art models such as FLUX and SD3, making high-quality text-to-image generation more accessible.

arxiv情報

著者	Lifu Wang,Daqing Liu,Xinchen Liu,Xiaodong He
発行日	2025-03-25 17:55:20+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Scaling Down Text Encoders of Text-to-Image Diffusion Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー