MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

要約

インターリーブされた画像とテキストのデータの生成モデルの開発には、研究と実用の両方の価値があります。
モデルがインターリーブされたシーケンスを理解し、その後画像とテキストを生成する必要があります。
ただし、既存の試みは、固定数のビジュアルトークンでは画像の詳細を効率的にキャプチャできないという問題によって制限されており、これは特に複数画像のシナリオで問題となります。
これに対処するために、この論文では、インターリーブされた画像テキストデータのエンドツーエンド生成モデルである MM-Interleaved を紹介します。
マルチスケールおよびマルチ画像特徴シンクロナイザーモジュールが導入されており、生成プロセス中に以前のコンテキストで詳細な画像特徴に直接アクセスできるようになります。
MM-Interleaved は、ペアおよびインターリーブされた画像とテキストのコーパスの両方でエンドツーエンドで事前トレーニングされています。
これは教師あり微調整フェーズを通じてさらに強化され、モデルは複雑なマルチモーダル命令に従う能力が向上します。
実験では、マルチモーダル命令に従って視覚的な詳細を認識し、テキスト条件と視覚条件の両方に従って一貫した画像を生成するという MM インターリーブの多用途性を実証しています。
コードとモデルは \url{https://github.com/OpenGVLab/MM-Interleaved} で入手できます。

要約(オリジナル)

Developing generative models for interleaved image-text data has both research and practical value. It requires models to understand the interleaved sequences and subsequently generate images and text. However, existing attempts are limited by the issue that the fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in the multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end generative model for interleaved image-text data. It introduces a multi-scale and multi-image feature synchronizer module, allowing direct access to fine-grained image features in the previous context during the generation process. MM-Interleaved is end-to-end pre-trained on both paired and interleaved image-text corpora. It is further enhanced through a supervised fine-tuning phase, wherein the model improves its ability to follow complex multi-modal instructions. Experiments demonstrate the versatility of MM-Interleaved in recognizing visual details following multi-modal instructions and generating consistent images following both textual and visual conditions. Code and models are available at \url{https://github.com/OpenGVLab/MM-Interleaved}.

arxiv情報

著者	Changyao Tian,Xizhou Zhu,Yuwen Xiong,Weiyun Wang,Zhe Chen,Wenhai Wang,Yuntao Chen,Lewei Lu,Tong Lu,Jie Zhou,Hongsheng Li,Yu Qiao,Jifeng Dai
発行日	2024-01-18 18:50:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー