MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

要約

この研究では、ビジュアル予測命令チューニング (VPiT) を提案します。これは、ビジュアル命令チューニングのシンプルで効果的な拡張機能で、事前トレーニングされた LLM を、テキストトークンとビジュアルトークンの両方を生成できる統合自己回帰モデルにすばやく変形できるようにします。
VPiT は、命令に従う形式でキュレーションされた画像およびテキストデータの入力シーケンスから、離散テキストトークンと連続ビジュアルトークンを予測するように LLM に教えます。
私たちの実証的調査により、VPiT のいくつかの興味深い特性が明らかになりました。(1) 視覚生成能力は、視覚理解の向上の自然な副産物として現れ、少量の生成データで効率的にロックを解除できます。
(2) 理解と生成は相互に有益であると考えていますが、データの理解はデータの生成よりも効果的に両方の機能に貢献します。
これらの発見に基づいて、MetaMorph モデルをトレーニングし、視覚的な理解と生成の両方で競争力のあるパフォーマンスを達成します。
ビジュアル生成では、MetaMorph は LLM の事前トレーニングから得た世界の知識と推論能力を活用し、他の世代のモデルで示される一般的な障害モードを克服できます。
私たちの結果は、LLM が、比較的単純な命令調整プロセスで視覚的な理解と生成の両方に効率的に適応できる強力な「事前」ビジョン機能を備えている可能性があることを示唆しています。

要約(オリジナル)

In this work, we propose Visual-Predictive Instruction Tuning (VPiT) – a simple and effective extension to visual instruction tuning that enables a pretrained LLM to quickly morph into an unified autoregressive model capable of generating both text and visual tokens. VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data curated in an instruction-following format. Our empirical investigation reveals several intriguing properties of VPiT: (1) visual generation ability emerges as a natural byproduct of improved visual understanding, and can be unlocked efficiently with a small amount of generation data; (2) while we find understanding and generation to be mutually beneficial, understanding data contributes to both capabilities more effectively than generation data. Building upon these findings, we train our MetaMorph model and achieve competitive performance on both visual understanding and generation. In visual generation, MetaMorph can leverage the world knowledge and reasoning abilities gained from LLM pretraining, and overcome common failure modes exhibited by other generation models. Our results suggest that LLMs may have strong ‘prior’ vision capabilities that can be efficiently adapted to both visual understanding and generation with a relatively simple instruction tuning process.

arxiv情報

著者	Shengbang Tong,David Fan,Jiachen Zhu,Yunyang Xiong,Xinlei Chen,Koustuv Sinha,Michael Rabbat,Yann LeCun,Saining Xie,Zhuang Liu
発行日	2024-12-18 18:58:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー