TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

要約

マルチモーダル大規模言語モデル (MM-LLM) は最近目覚ましい進歩を遂げていますが、マルチモーダル入力間の相互作用と非テキストモダリティでの生成を効率的にモデル化するのに依然として苦労しています。
この研究では、任意のモダリティからの入力をトークンシーケンスとして扱い、すべてのモダリティの結合埋め込み空間を学習するアプローチである TEAL (Tokenize and Embed ALl) を提案します。
具体的には、任意のモダリティからの入力に対して、TEAL はまず既製のトークナイザーを使用してトークンシーケンスに離散化し、そのトークンシーケンスを学習可能な埋め込み行列を使用して結合埋め込み空間に埋め込みます。
MM-LLM は、テキスト LLM と同様に、マルチモーダルトークンを自己回帰的に予測する必要があるだけです。
最後に、対応するデトークナイザーが適用され、予測されたトークンシーケンスに基づいて各モダリティで出力が生成されます。
結合埋め込み空間を使用すると、TEAL により、凍結された LLM が画像や音声などの非テキストモダリティを含む理解タスクと生成タスクの両方を実行できるようになります。
したがって、テキスト LLM は単にインターフェイスとして機能し、テキストの理解と生成において高いパフォーマンスを維持できます。
実験では、TEAL がマルチモーダルの理解において大幅な改善を達成し、マルチモーダル生成のための単純なスキームを実装していることが示されています。

要約(オリジナル)

Despite Multi-modal Large Language Models (MM-LLMs) have made exciting strides recently, they are still struggling to efficiently model the interactions among multi-modal inputs and the generation in non-textual modalities. In this work, we propose TEAL (Tokenize and Embed ALl)}, an approach to treat the input from any modality as a token sequence and learn a joint embedding space for all modalities. Specifically, for the input from any modality, TEAL first discretizes it into a token sequence with the off-the-shelf tokenizer and embeds the token sequence into a joint embedding space with a learnable embedding matrix. MM-LLMs just need to predict the multi-modal tokens autoregressively as the textual LLMs do. Finally, the corresponding de-tokenizer is applied to generate the output in each modality based on the predicted token sequence. With the joint embedding space, TEAL enables the frozen LLMs to perform both understanding and generation tasks involving non-textual modalities, such as image and audio. Thus, the textual LLM can just work as an interface and maintain its high performance in textual understanding and generation. Experiments show that TEAL achieves substantial improvements in multi-modal understanding, and implements a simple scheme for multi-modal generations.

arxiv情報

著者	Zhen Yang,Yingxue Zhang,Fandong Meng,Jie Zhou
発行日	2023-11-08 10:34:16+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

TEAL: Tokenize and Embed ALL for Multi-modal Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー