TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

要約

ChameleonやEMU3などの先駆的なトークンベースの作品は、マルチモーダル統一の基礎を確立していますが、高レベルのセマンティクスの欠如により、高トレーニングの計算オーバーヘッドと限られた理解パフォーマンスの課題に直面しています。
このホワイトペーパーでは、ベクター定量化（VQ）トークンをセマンティック化し、クリップレベルのセマンティクスを組み込んで標準のマルチモーダル自己回帰トレーニングを標準VQトークンで組み込むことで理解を高める視覚トークネイザーであるToklipを紹介します。
Toklipは、低レベルの離散VQトークナー剤をVITベースのトークンエンコーダと統合して、高レベルの連続セマンティクスをキャプチャします。
高レベルの機能を離散化する以前のアプローチ（Vila-uなど）とは異なり、Toklip Direnentanglesトレーニング目標を理解と生成のためにトレーニングするため、テーラード量子化操作を必要とせずに高度なVQトーナイザーを直接適用できます。
私たちの経験的結果は、Toklipが並外れたデータ効率を達成し、低レベルのセマンティック理解で視覚的なトークンに力を与え、低レベルの生成能力を強化し、理解と生成タスクの両方で自己回帰変圧器に適していることを示しています。
コードとモデルは、https：//github.com/tencentarc/toklipで入手できます。

要約(オリジナル)

Pioneering token-based works such as Chameleon and Emu3 have established a foundation for multimodal unification but face challenges of high training computational overhead and limited comprehension performance due to a lack of high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer that enhances comprehension by semanticizing vector-quantized (VQ) tokens and incorporating CLIP-level semantics while enabling end-to-end multimodal autoregressive training with standard VQ tokens. TokLIP integrates a low-level discrete VQ tokenizer with a ViT-based token encoder to capture high-level continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize high-level features, TokLIP disentangles training objectives for comprehension and generation, allowing the direct application of advanced VQ tokenizers without the need for tailored quantization operations. Our empirical results demonstrate that TokLIP achieves exceptional data efficiency, empowering visual tokens with high-level semantic understanding while enhancing low-level generative capacity, making it well-suited for autoregressive Transformers in both comprehension and generation tasks. The code and models are available at https://github.com/TencentARC/TokLIP.

arxiv情報

著者	Haokun Lin,Teng Wang,Yixiao Ge,Yuying Ge,Zhichao Lu,Ying Wei,Qingfu Zhang,Zhenan Sun,Ying Shan
発行日	2025-05-08 17:12:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー