Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

要約

大規模言語モデル (LLM) の開発は、統一されたフレームワーク内でテキスト、画像、および音声を処理できるマルチモーダルシステムに拡張されました。
これらのモデルをトレーニングするには、テキストのみの LLM と比較して、大幅に大規模なデータセットと計算リソースが必要になります。
スケーリングの課題に対処するために、事前トレーニングの計算コストを大幅に削減するスパースマルチモーダルトランスフォーマーアーキテクチャである Mixture-of-Transformers (MoT) を導入します。
MoT は、モダリティごとにモデルの非埋め込みパラメーター (フィードフォワードネットワーク、アテンションマトリックス、レイヤー正規化など) を分離し、入力シーケンス全体にわたってグローバルなセルフアテンションによるモダリティ固有の処理を可能にします。
複数の設定とモデルスケールにわたって MoT を評価します。
Chameleon 7B 設定 (自己回帰テキストと画像の生成) では、MoT は FLOP の 55.8\% のみを使用して高密度ベースラインのパフォーマンスと一致します。
音声を含めるように拡張すると、MoT は FLOP のわずか 37.2% で、密なベースラインに匹敵する音声パフォーマンスに達します。
テキストと画像が異なる目的でトレーニングされる Transfusion 設定では、7B MoT モデルは FLOP の 3 分の 1 で密ベースラインの画像モダリティパフォーマンスと一致し、760M MoT モデルはキー画像生成全体で 1.4B 密ベースラインを上回ります。
メトリクス。
システムプロファイリングでは、MoT の実際的な利点がさらに強調され、実時間の 47.2% で高密度のベースライン画像品質を達成し、実時間の 75.6% でテキスト品質を達成しています (NVIDIA A100 GPU を搭載した AWS p4de.24xlarge インスタンスで測定)。

要約(オリジナル)

The development of large language models (LLMs) has expanded to multi-modal systems capable of processing text, images, and speech within a unified framework. Training these models demands significantly larger datasets and computational resources compared to text-only LLMs. To address the scaling challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal transformer architecture that significantly reduces pretraining computational costs. MoT decouples non-embedding parameters of the model by modality — including feed-forward networks, attention matrices, and layer normalization — enabling modality-specific processing with global self-attention over the full input sequence. We evaluate MoT across multiple settings and model scales. In the Chameleon 7B setting (autoregressive text-and-image generation), MoT matches the dense baseline’s performance using only 55.8\% of the FLOPs. When extended to include speech, MoT reaches speech performance comparable to the dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where text and image are trained with different objectives, a 7B MoT model matches the image modality performance of the dense baseline with one third of the FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image generation metrics. System profiling further highlights MoT’s practical benefits, achieving dense baseline image quality in 47.2\% of the wall-clock time and text quality in 75.6\% of the wall-clock time (measured on AWS p4de.24xlarge instances with NVIDIA A100 GPUs).

arxiv情報

著者	Weixin Liang,Lili Yu,Liang Luo,Srinivasan Iyer,Ning Dong,Chunting Zhou,Gargi Ghosh,Mike Lewis,Wen-tau Yih,Luke Zettlemoyer,Xi Victoria Lin
発行日	2024-11-07 18:59:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー