DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

要約

ここでは、大規模な専門家混合 (MoE) ビジョン言語モデルの高度なシリーズである DeepSeek-VL2 を紹介します。これは、2 つの重要なメジャーアップグレードを通じて、前任者の DeepSeek-VL を大幅に改善しています。
ビジョンコンポーネントには、さまざまなアスペクト比の高解像度画像を処理するために設計されたダイナミックタイリングビジョンエンコード戦略が組み込まれています。
言語コンポーネントについては、Key-Value キャッシュを潜在ベクトルに圧縮するマルチヘッド潜在アテンションメカニズムを備えた DeepSeekMoE モデルを活用して、効率的な推論と高スループットを可能にします。
改良された視覚言語データセットでトレーニングされた DeepSeek-VL2 は、視覚的な質問応答、光学式文字認識、文書/表/チャートの理解、および視覚的基礎付けを含むがこれらに限定されない、さまざまなタスクにわたって優れた機能を実証します。
当社のモデルシリーズは、DeepSeek-VL2-Tiny、DeepSeek-VL2-Small、DeepSeek-VL2 の 3 つのバリアントで構成されており、それぞれ 1.0B、2.8B、および 4.5B の有効化パラメータを備えています。
DeepSeek-VL2 は、既存のオープンソースの高密度モデルおよび MoE ベースのモデルと比較して、同等または少ない有効化パラメータで競争力のあるまたは最先端のパフォーマンスを実現します。
コードと事前トレーニングされたモデルは、https://github.com/deepseek-ai/DeepSeek-VL2 で公開されています。

要約(オリジナル)

We present DeepSeek-VL2, an advanced series of large Mixture-of-Experts (MoE) Vision-Language Models that significantly improves upon its predecessor, DeepSeek-VL, through two key major upgrades. For the vision component, we incorporate a dynamic tiling vision encoding strategy designed for processing high-resolution images with different aspect ratios. For the language component, we leverage DeepSeekMoE models with the Multi-head Latent Attention mechanism, which compresses Key-Value cache into latent vectors, to enable efficient inference and high throughput. Trained on an improved vision-language dataset, DeepSeek-VL2 demonstrates superior capabilities across various tasks, including but not limited to visual question answering, optical character recognition, document/table/chart understanding, and visual grounding. Our model series is composed of three variants: DeepSeek-VL2-Tiny, DeepSeek-VL2-Small and DeepSeek-VL2, with 1.0B, 2.8B and 4.5B activated parameters respectively. DeepSeek-VL2 achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models. Codes and pre-trained models are publicly accessible at https://github.com/deepseek-ai/DeepSeek-VL2.

arxiv情報

著者	Zhiyu Wu,Xiaokang Chen,Zizheng Pan,Xingchao Liu,Wen Liu,Damai Dai,Huazuo Gao,Yiyang Ma,Chengyue Wu,Bingxuan Wang,Zhenda Xie,Yu Wu,Kai Hu,Jiawei Wang,Yaofeng Sun,Yukun Li,Yishi Piao,Kang Guan,Aixin Liu,Xin Xie,Yuxiang You,Kai Dong,Xingkai Yu,Haowei Zhang,Liang Zhao,Yisong Wang,Chong Ruan
発行日	2024-12-13 17:37:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー