OneLLM: One Framework to Align All Modalities with Language

要約

マルチモーダル大規模言語モデル (MLLM) は、その強力なマルチモーダル理解機能により大きな注目を集めています。
しかし、既存の作品はモダリティ固有のエンコーダに大きく依存しており、通常、エンコーダはアーキテクチャが異なり、共通のモダリティに限定されています。
このペーパーでは、統一フレームワークを使用して 8 つのモダリティを言語に合わせて調整する MLLM である OneLLM を紹介します。
これは、統合されたマルチモーダルエンコーダーとプログレッシブマルチモーダルアライメントパイプラインを通じて実現されます。
詳細には、まず画像投影モジュールをトレーニングして、ビジョンエンコーダを LLM に接続します。
次に、複数の画像投影モジュールと動的ルーティングを組み合わせてユニバーサル投影モジュール (UPM) を構築します。
最後に、UPM を使用して、より多くのモダリティを LLM に徐々に調整していきます。
次の指示で OneLLM の可能性を最大限に活用するために、画像、音声、ビデオ、点群、深度/法線マップ、IMU、fMRI の脳活動からの 200 万項目を含む、包括的なマルチモーダルな指示データセットも厳選しました。
OneLLM は、マルチモーダルキャプション、質問応答、推論などのタスクを含む 25 の多様なベンチマークで評価され、優れたパフォーマンスを発揮します。
コード、データ、モデル、オンラインデモは https://github.com/csuhan/OneLLM で入手できます。

要約(オリジナル)

Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However, existing works rely heavily on modality-specific encoders, which usually differ in architecture and are limited to common modalities. In this paper, we present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail, we first train an image projection module to connect a vision encoder with LLM. Then, we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally, we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions, we also curated a comprehensive multimodal instruction dataset, including 2M items from image, audio, video, point cloud, depth/normal map, IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning, where it delivers excellent performance. Code, data, model and online demo are available at https://github.com/csuhan/OneLLM

arxiv情報

著者	Jiaming Han,Kaixiong Gong,Yiyuan Zhang,Jiaqi Wang,Kaipeng Zhang,Dahua Lin,Yu Qiao,Peng Gao,Xiangyu Yue
発行日	2023-12-06 18:59:19+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OneLLM: One Framework to Align All Modalities with Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー