Spider: Any-to-Many Multimodal LLM

要約

マルチモーダルLLM（MLLM）は、大規模な言語モデル（LLM）の拡張として浮上しており、さまざまなモダリティの統合を可能にしています。
ただし、Any-to-to-Any MLLMは、Text + {画像またはオーディオまたはビデオ}など、単一の応答内でペアワイズモダリティの「テキスト + x」を生成することに限定されています。
この制限に対処するために、Text + {Image and Audio and Video}などのモダリティのテキスト + XS ‘の任意の組み合わせを生成できる、新規効率的なマニュ対モダリティ生成（AMMG）フレームワークであるSpiderを紹介します。
効率的なAMMGを実現するために、Spiderは3つのコアコンポーネントを統合します。基本的なX-to-X（つまり、Any-to-to-Any）モダリティ処理のベースモデル、XS信号プロンプトを生成するために設計されたすべての多くの命令テンプレート、およびXS（多型）の内容を生成するマルチモーダルデコーダーを制御するための新しい効率的なデコーダーコントローラーです。
クモを訓練するために、ammgに必要なX-to-xs（すなわち、すべての多くの）機能を学習することを容易にする新しいテキスト形式の多くのモーダル（TMM）データセットを構築しました。
最終的に、よく訓練されたクモは、X-to-X-to-X-X-XSの多くのモーダルデータセットである擬似X-to-XSデータセットを生成し、将来の研究におけるAMMGタスクの可能性を高めます。
全体として、この作業はマルチモーダル相互作用の境界を押し広げるだけでなく、フィールドを進めるための豊富なデータサポートも提供します。
コード：https：//github.com/layjins/spider

要約(オリジナル)

Multimodal LLMs (MLLMs) have emerged as an extension of Large Language Models (LLMs), enabling the integration of various modalities. However, Any-to-Any MLLMs are limited to generating pairwise modalities ‘Text + X’ within a single response, such as Text + {Image or Audio or Video}. To address this limitation, we introduce Spider, a novel efficient Any-to-Many Modalities Generation (AMMG) framework, which can generate an arbitrary combination of modalities ‘Text + Xs’, such as Text + {Image and Audio and Video}. To achieve efficient AMMG, our Spider integrates three core components: a Base Model for basic X-to-X (i.e., Any-to-Any) modality processing, an Any-to-Many Instruction Template designed for producing Xs signal prompts, and a novel Efficient Decoders-Controller for controlling multimodal Decoders to generate Xs (many-modal) contents. To train Spider, we constructed a novel Text-formatted Many-Modal (TMM) dataset, which facilitates learning the X-to-Xs (i.e., Any-to-Many) capability necessary for AMMG. Ultimately, the well-trained Spider generates a pseudo X-to-Xs dataset, the first-ever X-to-Xs many-modal dataset, enhancing the potential for AMMG tasks in future research. Overall, this work not only pushes the boundary of multimodal interaction but also provides rich data support for advancing the field. Code: https://github.com/Layjins/Spider

arxiv情報

著者	Jinxiang Lai,Jie Zhang,Jun Liu,Jian Li,Xiaocheng Lu,Song Guo
発行日	2025-04-07 16:13:38+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Spider: Any-to-Many Multimodal LLM

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー