OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

要約

統一されたマルチモーダル理解と視覚生成（またはマルチモーダル生成）モデルの最近の進歩は、大規模な計算の複雑さと大規模なトレーニングデータへの依存によって妨げられています。
統一された次のトークン予測パラダイムを介してテキストと画像の両方を生成する最初の線形アーキテクチャベースのマルチモーダル生成モデルであるOmnimambaを紹介します。
このモデルは、MAMBA-2の高い計算効率とメモリ効率を完全に活用し、テキスト生成からマルチモーダル生成に拡大します。
既存の統一モデルのデータ非効率性に対処するために、2つの重要なイノベーションを提案します。（1）モダリティ固有の生成を導くための語彙の分離された語彙、および（2）パラメーター効率の高い適応のためのタスク固有のLORA。
さらに、2つのタスク間のデータの不均衡を緩和するために、分離された2段階のトレーニング戦略を導入します。
これらのテクニックを備えたOmnimambaは、Show-Oの1,000倍少ない2mの画像テキストペアでトレーニングされているにもかかわらず、ベンチマーク全体でShow-Oを上回りながら、Janusflowで競争力のあるパフォーマンスを達成します。
特に、Omnimambaは優れた推論効率で際立っており、トランスベースの対応物と比較して、長いシーケンス生成で最大119.2倍のスピードアップと63％のGPUメモリ削減を達成しています。
コードとモデルはhttps://github.com/hustvl/omnimambaでリリースされます

要約(オリジナル)

Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2’s high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at https://github.com/hustvl/OmniMamba

arxiv情報

著者	Jialv Zou,Bencheng Liao,Qian Zhang,Wenyu Liu,Xinggang Wang
発行日	2025-03-11 17:59:46+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー