AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

要約

AnyGPT は、音声、テキスト、画像、音楽などのさまざまなモダリティの統合処理に離散表現を利用する、any-to-any マルチモーダル言語モデルです。
AnyGPT は、現在の大規模言語モデル (LLM) アーキテクチャやトレーニングパラダイムを変更することなく、安定してトレーニングできます。
代わりに、データレベルの前処理のみに依存し、新しい言語を組み込むのと同様に、新しいモダリティの LLM へのシームレスな統合を促進します。
マルチモーダルアライメントの事前トレーニング用に、マルチモーダルテキスト中心のデータセットを構築します。
生成モデルを利用して、最初の大規模な任意対任意のマルチモーダル命令データセットを合成します。
これは、さまざまなモダリティを複雑に織り交ぜたマルチターン会話の 108,000 サンプルで構成されているため、モデルがマルチモーダルな入力と出力の任意の組み合わせを処理できるようになります。
実験結果は、AnyGPT があらゆるモダリティにわたって特殊化されたモデルに匹敵するパフォーマンスを達成しながら、任意の対任意のマルチモーダル会話を促進できることを示し、離散表現が言語モデル内の複数のモダリティを効果的かつ便利に統合できることを証明しています。
デモは https://junzhan2000.github.io/AnyGPT.github.io/ に示されています。

要約(オリジナル)

We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities, including speech, text, images, and music. AnyGPT can be trained stably without any alterations to the current large language model (LLM) architecture or training paradigms. Instead, it relies exclusively on data-level preprocessing, facilitating the seamless integration of new modalities into LLMs, akin to the incorporation of new languages. We build a multimodal text-centric dataset for multimodal alignment pre-training. Utilizing generative models, we synthesize the first large-scale any-to-any multimodal instruction dataset. It consists of 108k samples of multi-turn conversations that intricately interweave various modalities, thus equipping the model to handle arbitrary combinations of multimodal inputs and outputs. Experimental results demonstrate that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities, proving that discrete representations can effectively and conveniently unify multiple modalities within a language model. Demos are shown in https://junzhan2000.github.io/AnyGPT.github.io/

arxiv情報

著者	Jun Zhan,Junqi Dai,Jiasheng Ye,Yunhua Zhou,Dong Zhang,Zhigeng Liu,Xin Zhang,Ruibin Yuan,Ge Zhang,Linyang Li,Hang Yan,Jie Fu,Tao Gui,Tianxiang Sun,Yugang Jiang,Xipeng Qiu
発行日	2024-02-19 15:33:10+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー