ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability

要約

統一されたマルチモーダルの理解と世代は最近、ビジョンと言語の分野で多くの注目を集めています。
既存のUNIMは、マルチモーダルの理解と生成能力の両方を同時に学習し、実質的な計算リソースを要求し、多くの場合、インターリーブテキストイメージの生成に苦労するように設計されています。
既存のマルチモーダル大手言語モデル（MLLMS）を微調整することにより、理解と生成の両方を達成するリソース効率の良い純粋な自己回帰フレームワークであるアーマーを提示します。
具体的には、Armorは既存のMLLMを3つの観点から拡張します。（1）モデルアーキテクチャの場合、前方切り替えメカニズムを備えた非対称エンコーダーデコーダーアーキテクチャが、最小計算のある自然なテキストインターリード生成を可能にするためのテキストと視覚的モダリティを統合する埋め込みスペースを統合するために導入されます。
（2）トレーニングデータのために、細心の注意を払った高品質のインターリーブデータセットが微調整されたMLLMのために収集されます。
（3）トレーニングアルゴリズムについては、収集されたデータセットに基づいた3つのプログレッシブトレーニング段階を通じて、マルチモーダル理解機能を維持しながら、既存のMLLMをマルチモーダル生成機能にエンスするために「何またはどのように生成するか」を提案します。
実験結果は、ARMORが限られたトレーニングリソースを使用して、有望な画像生成機能を備えたUNIMに既存のMLLMをアップグレードすることを示しています。
私たちのコードは、https://github.com/finyorko/armorでまもなくリリースされます。

要約(オリジナル)

Unified multimodal understanding and generation have recently received much attention in the area of vision and language. Existing UniMs are designed to simultaneously learn both multimodal understanding and generation capabilities, demanding substantial computational resources, and often struggle to generate interleaved text-image. We present ARMOR, a resource-efficient and pure autoregressive framework that achieves both understanding and generation by fine-tuning existing multimodal large language models (MLLMs). Specifically, ARMOR extends existing MLLMs from three perspectives: (1) For model architecture, an asymmetric encoder-decoder architecture with a forward-switching mechanism is introduced to unify embedding space integrating textual and visual modalities for enabling natural text-image interleaved generation with minimal computational overhead. (2) For training data, a meticulously curated, high-quality interleaved dataset is collected for fine-tuning MLLMs. (3) For the training algorithm, we propose a “what or how to generate” algorithm to empower existing MLLMs with multimodal generation capabilities while preserving their multimodal understanding capabilities, through three progressive training stages based on the collected dataset. Experimental results demonstrate that ARMOR upgrades existing MLLMs to UniMs with promising image generation capabilities, using limited training resources. Our code will be released soon at https://github.com/finyorko/armor.

arxiv情報

著者	Jianwen Sun,Yukang Feng,Chuanhao Li,Fanrui Zhang,Zizhen Li,Jiaxin Ai,Sizhuo Zhou,Yu Dai,Shenglin Zhang,Kaipeng Zhang
発行日	2025-06-06 15:03:44+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ARMOR: Empowering Multimodal Understanding Model with Interleaved Multimodal Generation Capability

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー