Spider: Any-to-Many Multimodal LLM

要約

マルチモーダル LLM (MLLM) は、大規模言語モデル (LLM) の拡張として登場し、さまざまなモダリティの統合を可能にします。
ただし、Any-to-Any MLLM は、Text + {Image or Audio or Video} など、単一の応答内でペアワイズモダリティ ‘Text + X’ を生成することに制限されています。
この制限に対処するために、新しく効率的な Any-to-Many Modalities Generation (AMMG) フレームワークである Spider を導入します。これは、Text + {Image and Audio and Video} などのモダリティ「Text + X」の任意の組み合わせを生成できます。
効率的な AMMG を実現するために、当社の Spider は 3 つのコアコンポーネントを統合しています。基本的な X-to-X (つまり、Any-to-Any) モダリティ処理用のベースモデル、X を生成するためにマルチモーダルデコーダを制御するための新しい効率的なデコーダコントローラです。
modal) コンテンツ、および Xs 信号プロンプトを生成するために設計された Any-to-Many 命令テンプレート。
Spider をトレーニングするために、AMMG に必要な X-to-X (つまり、Any-to-Many) 機能の学習を容易にする、新しいテキスト形式のMany-Modal (TMM) データセットを構築しました。
最終的に、よく訓練された Spider は、史上初の X-to-Xs 多モーダルデータセットである疑似 X-to-Xs データセットを生成し、将来の研究における AMMG タスクの可能性を高めます。
全体として、この研究はマルチモーダルインタラクションの境界を押し広げるだけでなく、この分野を前進させるための豊富なデータサポートも提供します。

要約(オリジナル)

Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities ‘Text + X’ within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities ‘Text + Xs’, such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents, and an Any-to-Many Instruction Template designed for producing Xs signal prompts. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates the learning of the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG task in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field.

arxiv情報

著者	Jinxiang Lai,Jie Zhang,Jun Liu,Jian Li,Xiaocheng Lu,Song Guo
発行日	2024-11-14 16:58:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spider: Any-to-Many Multimodal LLM

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー