DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

要約

既存のマルチモーダルモデルのほとんどは、マルチ画像、マルチラウンドのダイアログでインターリーブされた画像とテキストの入力を適切に管理できないために障害を受けており、トレーニングとデータアクセシビリティのためのリソース割り当てに大きな制約があり、適応性とスケーラビリティに影響を与えています。
さまざまなインタラクション領域にわたって。
これに対処するために、インターリーブ入力の処理における大規模ビジョンおよび言語モデルの習熟度を高めることに重点を置き、マルチモーダル機能を組み込むことで大規模言語モデル (LLM) を最適化するように設計された DeepSpeed-VisualChat フレームワークを紹介します。
私たちのフレームワークは、(1) マルチラウンドおよびマルチイメージダイアログのオープンソースサポート、(2) 革新的なマルチモーダル因果的注意メカニズムの導入、および (3) 既存のデータセットでのデータブレンディング技術を利用してシームレスな対話を保証することで注目に値します。
マルチラウンド、マルチイメージの会話でのインタラクション。
既存のフレームワークと比較して、DeepSpeed-VisualChat は最大 70B パラメータ言語モデルサイズまで優れたスケーラビリティを示し、マルチモーダル言語モデルの大幅な進歩を示し、将来の探索のための強固な基盤を確立します。

要約(オリジナル)

Most of the existing multi-modal models, hindered by their incapacity to adeptly manage interleaved image-and-text inputs in multi-image, multi-round dialogues, face substantial constraints in resource allocation for training and data accessibility, impacting their adaptability and scalability across varied interaction realms. To address this, we present the DeepSpeed-VisualChat framework, designed to optimize Large Language Models (LLMs) by incorporating multi-modal capabilities, with a focus on enhancing the proficiency of Large Vision and Language Models in handling interleaved inputs. Our framework is notable for (1) its open-source support for multi-round and multi-image dialogues, (2) introducing an innovative multi-modal causal attention mechanism, and (3) utilizing data blending techniques on existing datasets to assure seamless interactions in multi-round, multi-image conversations. Compared to existing frameworks, DeepSpeed-VisualChat shows superior scalability up to 70B parameter language model size, representing a significant advancement in multi-modal language models and setting a solid foundation for future explorations.

arxiv情報

著者	Zhewei Yao,Xiaoxia Wu,Conglong Li,Minjia Zhang,Heyang Qi,Olatunji Ruwase,Ammar Ahmad Awan,Samyam Rajbhandari,Yuxiong He
発行日	2023-09-25 17:53:29+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー