Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation

要約

テキストから画像への生成モデルにおける最近の進歩は、わずかな例から被験者のセマンティクスをキャプチャするために事前に訓練されたモデルを微調整する被験者駆動型生成を含む、多くの実用的なアプリケーションを可能にしている。拡散に基づくモデルは高品質な画像を生成するが、その大規模なノイズ除去ステップの結果、計算オーバーヘッドが大きくなり、実世界での適用が制限される。空間的に隣接したトークンではなく、次のスケールのトークンを予測する視覚的自己回帰～(VAR)モデルは、実用的な展開に適した、著しく高速な推論を提供する。本論文では、被写体駆動型生成のための最初のVARベースのアプローチを提案する。しかし、VARの微調整は、計算オーバヘッド、言語ドリフト、多様性の減少につながる。これらの課題に対処するために、複雑さを低減する選択的レイヤーチューニングと、言語ドリフトを緩和する事前蒸留を導入する。さらに、初期段階は、単に局所的な詳細を合成する後期段階よりも、主題の生成に大きな影響を与えることを発見した。この知見に基づき、局所的な詳細ではなく、主題に関連する情報に焦点を当てるようにモデルを促進するために、より粗い解像度を優先するスケール単位の重み付けチューニングを提案する。広範な実験により、我々の手法が様々な指標において拡散ベースのベースラインを大幅に上回ることを検証し、その実用的な利用法を示す。

要約(オリジナル)

Recent advances in text-to-image generative models have enabled numerous practical applications, including subject-driven generation, which fine-tunes pretrained models to capture subject semantics from only a few examples. While diffusion-based models produce high-quality images, their extensive denoising steps result in significant computational overhead, limiting real-world applicability. Visual autoregressive~(VAR) models, which predict next-scale tokens rather than spatially adjacent ones, offer significantly faster inference suitable for practical deployment. In this paper, we propose the first VAR-based approach for subject-driven generation. However, na\'{\i}ve fine-tuning VAR leads to computational overhead, language drift, and reduced diversity. To address these challenges, we introduce selective layer tuning to reduce complexity and prior distillation to mitigate language drift. Additionally, we found that the early stages have a greater influence on the generation of subject than the latter stages, which merely synthesize local details. Based on this finding, we propose scale-wise weighted tuning, which prioritizes coarser resolutions for promoting the model to focus on the subject-relevant information instead of local details. Extensive experiments validate that our method significantly outperforms diffusion-based baselines across various metrics and demonstrates its practical usage.

arxiv情報

著者	Jiwoo Chung,Sangeek Hyun,Hyunjun Kim,Eunseo Koh,MinKyu Lee,Jae-Pil Heo
発行日	2025-04-03 14:12:55+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, DeepL

Fine-Tuning Visual Autoregressive Models for Subject-Driven Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー