One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

要約

多くの言語の大規模な大規模な言語モデル（LLMS）を一度に事前に削除することは、モデル容量が限られているため、高品質のデータが乏しいため、制約の計算により困難です。
さらに、トークン剤の言語カバレッジの欠如により、トレーニング後の段階で純粋に新しい言語のギャップに対処することが困難になります。
この作業では、トレーニングの早い段階で比較的安価な介入が「言語の可塑性」、または新しい言語へのトレーニング後のモデルの適応能力を改善するものを研究します。
トークン剤の設計に焦点を当て、主要な前提条件よりも多くの言語のために訓練されたユニバーサルトークンザーを使用して、事前削除後の言語カバレッジを拡大する際の効率的な適応を可能にすることを提案します。
言語の多様なグループとさまざまなトレーニング戦略にわたる私たちの体系的な実験は、普遍的なトンナイザーが前の言語に特有のトークンザーと比較して、最大20.2％のWIN率の上昇を可能にすることを示しています。
さらに、普遍的なトークナイザーは、最大5％の勝利率の向上により、トークン剤や事前脱出で完全に目に見えない言語に対する可塑性が向上します。
私たちは、事前トレーニングに含まれる言語の大部分のパフォーマンスの妥協を最小限に抑えて、拡張された一連の言語セットへのこの適応を達成します。

要約(オリジナル)

Pretraining massively multilingual Large Language Models (LLMs) for many languages at once is challenging due to limited model capacity, scarce high-quality data, and compute constraints. Moreover, the lack of language coverage of the tokenizer makes it harder to address the gap for new languages purely at the post-training stage. In this work, we study what relatively cheap interventions early on in training improve ‘language plasticity’, or adaptation capabilities of the model post-training to new languages. We focus on tokenizer design and propose using a universal tokenizer that is trained for more languages than the primary pretraining languages to enable efficient adaptation in expanding language coverage after pretraining. Our systematic experiments across diverse groups of languages and different training strategies show that a universal tokenizer enables significantly higher language adaptation, with up to 20.2% increase in win rates compared to tokenizers specific to pretraining languages. Furthermore, a universal tokenizer also leads to better plasticity towards languages that are completely unseen in the tokenizer and pretraining, by up to 5% win rate gain. We achieve this adaptation to an expanded set of languages with minimal compromise in performance on the majority of languages included in pretraining.

arxiv情報

著者	Diana Abagyan,Alejandro R. Salamanca,Andres Felipe Cruz-Salinas,Kris Cao,Hangyu Lin,Acyr Locatelli,Marzieh Fadaee,Ahmet Üstün,Sara Hooker
発行日	2025-06-12 14:47:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

One Tokenizer To Rule Them All: Emergent Language Plasticity via Multilingual Tokenizers

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー