Multimodal Latent Language Modeling with Next-Token Diffusion

要約

マルチモーダル生成モデルでは、離散データ (テキストやコードなど) と連続データ (画像、音声、ビデオなど) の両方を処理するための統一されたアプローチが必要です。
この研究では、因果変換器を使用して連続データと離散データをシームレスに統合する潜在言語モデリング (LatentLM) を提案します。
具体的には、変分オートエンコーダ (VAE) を使用して連続データを潜在ベクトルとして表現し、これらのベクトルの自己回帰生成のためにネクストトークン拡散を導入します。
さらに、自己回帰モデリングにとって重要な分散崩壊の課題に対処する $\sigma$-VAE を開発しました。
広範な実験により、さまざまなモダリティにわたる LatentLM の有効性が実証されています。
画像生成においては、LatentLM はパフォーマンスとスケーラビリティの両方で拡散トランスフォーマーを上回っています。
LatentLM は、マルチモーダルな大規模言語モデルに統合されると、マルチモーダルな生成と理解を統合する汎用インターフェイスを提供します。
実験結果は、トレーニングトークンをスケールアップする設定において、LatentLM が Transfusion およびベクトル量子化モデルと比較して良好なパフォーマンスを達成することを示しています。
テキスト音声合成において、LatentLM は話者の類似性と堅牢性において最先端の VALL-E 2 モデルを上回り、必要なデコード手順は 10 分の 1 です。
その結果、LatentLM は大規模なマルチモーダルモデルを進歩させるための非常に効果的でスケーラブルなアプローチとして確立されました。

要約(オリジナル)

Multimodal generative models require a unified approach to handle both discrete data (e.g., text and code) and continuous data (e.g., image, audio, video). In this work, we propose Latent Language Modeling (LatentLM), which seamlessly integrates continuous and discrete data using causal Transformers. Specifically, we employ a variational autoencoder (VAE) to represent continuous data as latent vectors and introduce next-token diffusion for autoregressive generation of these vectors. Additionally, we develop $\sigma$-VAE to address the challenges of variance collapse, which is crucial for autoregressive modeling. Extensive experiments demonstrate the effectiveness of LatentLM across various modalities. In image generation, LatentLM surpasses Diffusion Transformers in both performance and scalability. When integrated into multimodal large language models, LatentLM provides a general-purpose interface that unifies multimodal generation and understanding. Experimental results show that LatentLM achieves favorable performance compared to Transfusion and vector quantized models in the setting of scaling up training tokens. In text-to-speech synthesis, LatentLM outperforms the state-of-the-art VALL-E 2 model in speaker similarity and robustness, while requiring 10x fewer decoding steps. The results establish LatentLM as a highly effective and scalable approach to advance large multimodal models.

arxiv情報

著者	Yutao Sun,Hangbo Bao,Wenhui Wang,Zhiliang Peng,Li Dong,Shaohan Huang,Jianyong Wang,Furu Wei
発行日	2024-12-11 18:57:32+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multimodal Latent Language Modeling with Next-Token Diffusion

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー