SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

要約

大規模言語モデル (LLM) の目覚ましい成功はマルチモーダル領域にまで広がり、画像の理解と生成において優れたパフォーマンスを達成しました。
これらの機能を統合する統合マルチモーダル大規模言語モデル (MLLM) を開発する最近の取り組みでは、有望な結果が示されています。
ただし、既存のアプローチではモデルアーキテクチャやトレーニングパイプラインに複雑な設計が含まれることが多く、モデルのトレーニングとスケーリングの難易度が高くなります。
この論文では、画像の理解と生成の両方が可能な、シンプルかつ強力なエンコーダ不要の MLLM である SynerGen-VL を提案します。
既存のエンコーダフリーの統合 MLLM で特定された課題に対処するために、トレーニングの複雑さを軽減しながら高解像度画像の理解を効果的にサポートする、トークンフォールディングメカニズムとビジョンエキスパートベースのプログレッシブアライメント事前トレーニング戦略を導入します。
SynerGen-VL は、統合されたネクストトークン予測目標を使用して大規模な画像とテキストの混合データでトレーニングされた後、同等またはそれより小さいパラメータサイズで既存のエンコーダ不要の統合 MLLM のパフォーマンスを達成または上回り、タスクとのギャップを狭めます。
特定の最先端モデルを紹介し、将来の統合 MLLM に向けた有望な道筋を強調します。
私たちのコードとモデルは公開されます。

要約(オリジナル)

The remarkable success of Large Language Models (LLMs) has extended to the multimodal domain, achieving outstanding performance in image understanding and generation. Recent efforts to develop unified Multimodal Large Language Models (MLLMs) that integrate these capabilities have shown promising results. However, existing approaches often involve complex designs in model architecture or training pipeline, increasing the difficulty of model training and scaling. In this paper, we propose SynerGen-VL, a simple yet powerful encoder-free MLLM capable of both image understanding and generation. To address challenges identified in existing encoder-free unified MLLMs, we introduce the token folding mechanism and the vision-expert-based progressive alignment pretraining strategy, which effectively support high-resolution image understanding while reducing training complexity. After being trained on large-scale mixed image-text data with a unified next-token prediction objective, SynerGen-VL achieves or surpasses the performance of existing encoder-free unified MLLMs with comparable or smaller parameter sizes, and narrows the gap with task-specific state-of-the-art models, highlighting a promising path toward future unified MLLMs. Our code and models shall be released.

arxiv情報

著者	Hao Li,Changyao Tian,Jie Shao,Xizhou Zhu,Zhaokai Wang,Jinguo Zhu,Wenhan Dou,Xiaogang Wang,Hongsheng Li,Lewei Lu,Jifeng Dai
発行日	2024-12-12 18:59:26+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

SynerGen-VL: Towards Synergistic Image Understanding and Generation with Vision Experts and Token Folding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー