Next Patch Prediction for Autoregressive Visual Generation

要約

Next Token Prediction (NTP) パラダイムに基づいて構築された自己回帰モデルは、言語タスクと視覚タスクの両方を統合する統一フレームワークの開発において大きな可能性を示します。
この研究では、自己回帰画像生成のための NTP を再考し、新しい Next Patch Prediction (NPP) パラダイムを提案します。
私たちの重要なアイデアは、イメージトークンをグループ化し、高密度の情報を含むパッチトークンに集約することです。
パッチトークンを短い入力シーケンスとして使用すると、自己回帰モデルが次のパッチを予測するようにトレーニングされるため、計算コストが大幅に削減されます。
さらに、画像データの自然な階層特性を利用する、マルチスケールの粗いパッチから細かいパッチへのグループ化戦略を提案します。
さまざまなモデル (100M ～ 1.4B パラメーター) での実験により、次のパッチ予測パラダイムは、ImageNet ベンチマークで最大 1.0 FID スコアまで画像生成品質を向上させながら、トレーニングコストを約 0.6 倍に削減できることが実証されました。
私たちの方法は、追加のトレーニング可能なパラメーターを導入したり、カスタム画像トークナイザーを特別に設計したりすることなく、元の自己回帰モデルのアーキテクチャを保持しているため、ビジュアル生成のためのさまざまな自己回帰モデルへの柔軟性とシームレスな適応が確保されていることを強調します。

要約(オリジナル)

Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks. In this work, we rethink the NTP for autoregressive image generation and propose a novel Next Patch Prediction (NPP) paradigm. Our key idea is to group and aggregate image tokens into patch tokens containing high information density. With patch tokens as a shorter input sequence, the autoregressive model is trained to predict the next patch, thereby significantly reducing the computational cost. We further propose a multi-scale coarse-to-fine patch grouping strategy that exploits the natural hierarchical property of image data. Experiments on a diverse range of models (100M-1.4B parameters) demonstrate that the next patch prediction paradigm could reduce the training cost to around 0.6 times while improving image generation quality by up to 1.0 FID score on the ImageNet benchmark. We highlight that our method retains the original autoregressive model architecture without introducing additional trainable parameters or specifically designing a custom image tokenizer, thus ensuring flexibility and seamless adaptation to various autoregressive models for visual generation.

arxiv情報

著者	Yatian Pang,Peng Jin,Shuo Yang,Bin Lin,Bin Zhu,Zhenyu Tang,Liuhan Chen,Francis E. H. Tay,Ser-Nam Lim,Harry Yang,Li Yuan
発行日	2025-01-02 12:14:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Next Patch Prediction for Autoregressive Visual Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー