Getting the most out of your tokenizer for pre-training and domain adaptation

要約

トークン化は、最新の LLM のコンポーネントとして十分に研究されておらず、無視されがちです。
公開されているほとんどの研究では、トークン化を最適化するためのアブレーションや分析を実行せずに、多くの場合、別のモデルから借用した単一のトークナイザーをすべての実験に使用しています。
さらに、基本モデルを微調整する場合、トークナイザーは通常変更されません。
このペーパーでは、トークナイザーのサイズ、トークン化前の正規表現、トレーニングデータが、モデルの生成速度、有効なコンテキストサイズ、メモリ使用量、およびダウンストリームパフォーマンスに大きな影響を与える可能性があることを示します。
私たちは、特殊なバイトペアエンコーディングコードトークナイザーをトレーニングし、HumanEval や MBPP などのコード生成タスク用の LLM のパフォーマンスに対するトークナイザー設計の影響を広範に除去し、トークナイザーのハイパーパラメーターの選択とトークナイザーの切り替えに関する推奨事項を提供します。
事前トレーニングされた LLM。
私たちは、ゼロからトレーニングされたモデルと事前トレーニングされたモデルから実験を実行し、幅広いユースケースへの適用性を検証します。
500 億を超えるトークンを微調整する場合、事前トレーニングされた LLM のトークナイザーを特殊化して、生成速度と有効なコンテキストサイズを大幅に向上できることがわかりました。

要約(オリジナル)

Tokenization is an understudied and often neglected component of modern LLMs. Most published works use a single tokenizer for all experiments, often borrowed from another model, without performing ablations or analysis to optimize tokenization. Moreover, the tokenizer is generally kept unchanged when fine-tuning a base model. In this paper, we show that the size, pre-tokenization regular expression, and training data of a tokenizer can significantly impact the model’s generation speed, effective context size, memory usage, and downstream performance. We train specialized Byte-Pair Encoding code tokenizers, and conduct extensive ablations on the impact of tokenizer design on the performance of LLMs for code generation tasks such as HumanEval and MBPP, and provide recommendations for tokenizer hyper-parameters selection and switching the tokenizer in a pre-trained LLM. We perform our experiments on models trained from scratch and from pre-trained models, verifying their applicability to a wide range of use-cases. We find that when fine-tuning on more than 50 billion tokens, we can specialize the tokenizer of a pre-trained LLM to obtain large gains in generation speed and effective context size.

arxiv情報

著者	Gautier Dagan,Gabriel Synnaeve,Baptiste Rozière
発行日	2024-02-07 10:51:11+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Getting the most out of your tokenizer for pre-training and domain adaptation

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー