ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

要約

現実世界の多様なモダリティを認識し、さまざまなタスクを解決できる汎用モデルを構築することは、人工知能の魅力的な目標です。
この論文では、さまざまなモダリティ間のギャップを埋める触媒として言語の表現能力を活用する、新しいマルチモーダル言語モデルである ChatBridge を紹介します。
すべてのモダリティを接続するには、言語ペアの 2 モダリティデータのみで十分であることを示します。
ChatBridge は、最新の大規模言語モデル (LLM) を活用し、そのゼロショット機能を拡張して、多様なマルチモーダル入力を組み込みます。
ChatBridge は 2 段階のトレーニングを受けます。
最初の段階では、各モダリティを言語に合わせて調整し、マルチモーダルな相関関係とコラボレーション能力を生み出します。
第 2 段階の命令微調整では、新しく提案された MULTIS という名前のマルチモーダル命令チューニングデータセットを使用してユーザーの意図に合わせて ChatBridge を微調整します。このデータセットは、テキスト、画像、ビデオ、オーディオモダリティの 16 個のマルチモーダルタスクの広範囲をカバーします。
私たちは、テキスト、画像、ビデオ、オーディオのモダリティをカバーするゼロショットのマルチモーダルタスクに関して、定量的および定性的な強力な結果を示しました。
ChatBridge のすべてのコード、データ、モデルはオープンソースになります。

要約(オリジナル)

Building general-purpose models that can perceive diverse real-world modalities and solve various tasks is an appealing target in artificial intelligence. In this paper, we present ChatBridge, a novel multimodal language model that leverages the expressive capabilities of language as the catalyst to bridge the gap between various modalities. We show that only language-paired two-modality data is sufficient to connect all modalities. ChatBridge leverages recent large language models (LLM) and extends their zero-shot capabilities to incorporate diverse multimodal inputs. ChatBridge undergoes a two-stage training. The first stage aligns each modality with language, which brings emergent multimodal correlation and collaboration abilities. The second stage instruction-finetunes ChatBridge to align it with user intent with our newly proposed multimodal instruction tuning dataset, named MULTIS, which covers a wide range of 16 multimodal tasks of text, image, video, and audio modalities. We show strong quantitative and qualitative results on zero-shot multimodal tasks covering text, image, video, and audio modalities. All codes, data, and models of ChatBridge will be open-sourced.

arxiv情報

著者	Zijia Zhao,Longteng Guo,Tongtian Yue,Sihan Chen,Shuai Shao,Xinxin Zhu,Zehuan Yuan,Jing Liu
発行日	2023-05-25 14:34:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー