S3: A Simple Strong Sample-effective Multimodal Dialog System

要約

この研究では、MMMU と AI Journey Contest 2023 という 2 つの魅力的なリーダーボードでほぼ最先端の結果を達成する、マルチモーダルダイアログタスクの概念的にシンプルかつ強力なベースラインである S3 モデルを紹介します。このシステムは、以下に基づいています。
事前トレーニングされた大規模な言語モデル、画像と音声用の事前トレーニングされたモダリティエンコーダー、およびトレーニング可能なモダリティプロジェクターです。
このようなアーキテクチャをトレーニングするために提案された効果的なデータ混合は、強力な言語モデルに基づき、少量のマルチモーダルデータでトレーニングされたマルチモーダルモデルが、マルチモーダルダイアログのタスクで効率的に実行できることを示しています。

要約(オリジナル)

In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.

arxiv情報

著者	Elisei Rykov,Egor Malkershin,Alexander Panchenko
発行日	2024-06-26 12:45:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

S3: A Simple Strong Sample-effective Multimodal Dialog System

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー