Mixtral of Experts

要約

Sparse Mixture of Experts (SMoE) 言語モデルである Mixtral 8x7B を紹介します。
Mixtral は Mistral 7B と同じアーキテクチャを持っていますが、各層が 8 つのフィードフォワードブロック (つまりエキスパート) で構成されている点が異なります。
ルーターネットワークは、各層のトークンごとに 2 人の専門家を選択して、現在の状態を処理し、その出力を結合します。
各トークンは 2 人のエキスパートのみを参照しますが、選択されたエキスパートは各タイムステップで異なる場合があります。
その結果、各トークンは 47B のパラメーターにアクセスできますが、推論中に使用されるアクティブなパラメーターは 13B のみです。
Mixtral は 32,000 トークンのコンテキストサイズでトレーニングされ、評価されたすべてのベンチマークにわたって Llama 2 70B および GPT-3.5 よりも優れたパフォーマンスまたは同等のパフォーマンスを発揮します。
特に、Mixtral は、数学、コード生成、多言語ベンチマークにおいて Llama 2 70B を大幅に上回っています。
また、人間のベンチマークで GPT-3.5 Turbo、Claude-2.1、Gemini Pro、Llama 2 70B のチャットモデルを超える、指示に従うように微調整されたモデル Mixtral 8x7B – Instruct も提供します。
基本モデルと命令モデルは両方とも、Apache 2.0 ライセンスに基づいてリリースされます。

要約(オリジナル)

We introduce Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral has the same architecture as Mistral 7B, with the difference that each layer is composed of 8 feedforward blocks (i.e. experts). For every token, at each layer, a router network selects two experts to process the current state and combine their outputs. Even though each token only sees two experts, the selected experts can be different at each timestep. As a result, each token has access to 47B parameters, but only uses 13B active parameters during inference. Mixtral was trained with a context size of 32k tokens and it outperforms or matches Llama 2 70B and GPT-3.5 across all evaluated benchmarks. In particular, Mixtral vastly outperforms Llama 2 70B on mathematics, code generation, and multilingual benchmarks. We also provide a model fine-tuned to follow instructions, Mixtral 8x7B – Instruct, that surpasses GPT-3.5 Turbo, Claude-2.1, Gemini Pro, and Llama 2 70B – chat model on human benchmarks. Both the base and instruct models are released under the Apache 2.0 license.

arxiv情報

著者	Albert Q. Jiang,Alexandre Sablayrolles,Antoine Roux,Arthur Mensch,Blanche Savary,Chris Bamford,Devendra Singh Chaplot,Diego de las Casas,Emma Bou Hanna,Florian Bressand,Gianna Lengyel,Guillaume Bour,Guillaume Lample,Lélio Renard Lavaud,Lucile Saulnier,Marie-Anne Lachaux,Pierre Stock,Sandeep Subramanian,Sophia Yang,Szymon Antoniak,Teven Le Scao,Théophile Gervet,Thibaut Lavril,Thomas Wang,Timothée Lacroix,William El Sayed
発行日	2024-01-08 18:47:34+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Mixtral of Experts

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー