DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

要約

経済的なトレーニングと効率的な推論を特徴とする強力な専門家混合 (MoE) 言語モデルである DeepSeek-V2 を紹介します。
これは合計 236B のパラメータで構成され、そのうち 21B がトークンごとにアクティブ化され、128K トークンのコンテキスト長をサポートします。
DeepSeek-V2 は、マルチヘッド潜在注意 (MLA) や DeepSeekMoE などの革新的なアーキテクチャを採用しています。
MLA は、Key-Value (KV) キャッシュを潜在ベクトルに大幅に圧縮することで効率的な推論を保証します。一方、DeepSeekMoE は、スパース計算により経済的なコストで強力なモデルをトレーニングできるようにします。
DeepSeek-V2 は、DeepSeek 67B と比較して、大幅に優れたパフォーマンスを達成すると同時に、トレーニングコストを 42.5% 節約し、KV キャッシュを 93.3% 削減し、最大生成スループットを 5.76 倍に高めます。
当社は、8.1T トークンで構成される高品質のマルチソースコーパスで DeepSeek-V2 を事前トレーニングし、さらに教師あり微調整 (SFT) と強化学習 (RL) を実行して、その可能性を完全に解き放ちます。
評価の結果、有効化されたパラメーターが 21B しかない場合でも、DeepSeek-V2 とそのチャットバージョンは依然としてオープンソースモデルの中でトップレベルのパフォーマンスを達成していることが示されています。

要約(オリジナル)

We present DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. It comprises 236B total parameters, of which 21B are activated for each token, and supports a context length of 128K tokens. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation. Compared with DeepSeek 67B, DeepSeek-V2 achieves significantly stronger performance, and meanwhile saves 42.5% of training costs, reduces the KV cache by 93.3%, and boosts the maximum generation throughput to 5.76 times. We pretrain DeepSeek-V2 on a high-quality and multi-source corpus consisting of 8.1T tokens, and further perform Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) to fully unlock its potential. Evaluation results show that, even with only 21B activated parameters, DeepSeek-V2 and its chat versions still achieve top-tier performance among open-source models.

arxiv情報

著者	DeepSeek-AI,Aixin Liu,Bei Feng,Bin Wang,Bingxuan Wang,Bo Liu,Chenggang Zhao,Chengqi Dengr,Chong Ruan,Damai Dai,Daya Guo,Dejian Yang,Deli Chen,Dongjie Ji,Erhang Li,Fangyun Lin,Fuli Luo,Guangbo Hao,Guanting Chen,Guowei Li,H. Zhang,Hanwei Xu,Hao Yang,Haowei Zhang,Honghui Ding,Huajian Xin,Huazuo Gao,Hui Li,Hui Qu,J. L. Cai,Jian Liang,Jianzhong Guo,Jiaqi Ni,Jiashi Li,Jin Chen,Jingyang Yuan,Junjie Qiu,Junxiao Song,Kai Dong,Kaige Gao,Kang Guan,Lean Wang,Lecong Zhang,Lei Xu,Leyi Xia,Liang Zhao,Liyue Zhang,Meng Li,Miaojun Wang,Mingchuan Zhang,Minghua Zhang,Minghui Tang,Mingming Li,Ning Tian,Panpan Huang,Peiyi Wang,Peng Zhang,Qihao Zhu,Qinyu Chen,Qiushi Du,R. J. Chen,R. L. Jin,Ruiqi Ge,Ruizhe Pan,Runxin Xu,Ruyi Chen,S. S. Li,Shanghao Lu,Shangyan Zhou,Shanhuang Chen,Shaoqing Wu,Shengfeng Ye,Shirong Ma,Shiyu Wang,Shuang Zhou,Shuiping Yu,Shunfeng Zhou,Size Zheng,T. Wang,Tian Pei,Tian Yuan,Tianyu Sun,W. L. Xiao,Wangding Zeng,Wei An,Wen Liu,Wenfeng Liang,Wenjun Gao,Wentao Zhang,X. Q. Li,Xiangyue Jin,Xianzu Wang,Xiao Bi,Xiaodong Liu,Xiaohan Wang,Xiaojin Shen,Xiaokang Chen,Xiaosha Chen,Xiaotao Nie,Xiaowen Sun,Xiaoxiang Wang,Xin Liu,Xin Xie,Xingkai Yu,Xinnan Song,Xinyi Zhou,Xinyu Yang,Xuan Lu,Xuecheng Su,Y. Wu,Y. K. Li,Y. X. Wei,Y. X. Zhu,Yanhong Xu,Yanping Huang,Yao Li,Yao Zhao,Yaofeng Sun,Yaohui Li,Yaohui Wang,Yi Zheng,Yichao Zhang,Yiliang Xiong,Yilong Zhao,Ying He,Ying Tang,Yishi Piao,Yixin Dong,Yixuan Tan,Yiyuan Liu,Yongji Wang,Yongqiang Guo,Yuchen Zhu,Yuduan Wang,Yuheng Zou,Yukun Zha,Yunxian Ma,Yuting Yan,Yuxiang You,Yuxuan Liu,Z. Z. Ren,Zehui Ren,Zhangli Sha,Zhe Fu,Zhen Huang,Zhen Zhang,Zhenda Xie,Zhewen Hao,Zhihong Shao,Zhiniu Wen,Zhipeng Xu,Zhongyu Zhang,Zhuoshu Li,Zihan Wang,Zihui Gu,Zilin Li,Ziwei Xie
発行日	2024-05-24 15:24:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー