Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

要約

英語は非常にリソースの多い言語であるため、高品質の大規模言語モデル (LLM) の事前トレーニングが可能です。
他のほとんどの言語についても同じことが言えません。主要な LLM は依然として英語以外の言語ではパフォーマンスが劣っており、これはおそらく利用可能な多言語事前トレーニングコーパスの品質と多様性のギャップが原因です。
この研究では、単一の高品質なソース言語から機械翻訳されたテキストが、多言語 LLM の事前トレーニングに大きく貢献できることがわかりました。
高品質の英語 Web データセットである FineWeb-Edu をフランス語、ドイツ語、スペイン語に翻訳し、最終的な 300B トークンデータセット (TransWeb-Edu と呼ぶ) を生成し、そこから 1.3B パラメータモデル CuatroLLM を事前トレーニングします。
このデータセットをスクラッチします。
5 つの非英語推論タスクにわたって、CuatroLLM は、約 6 という桁違いに少ないデータを使用しているにもかかわらず、Llama3.2 や Gemma2 などの閉じたデータを使用してトレーニングされた最先端の多言語モデルと同等またはそれを上回るパフォーマンスを示します。
Llama3.2 のトレーニングに使用されるトークンの割合。
さらに、TransWeb-Edu の 1% 未満に相当する追加のドメイン固有の事前トレーニングにより、CuatroLLM が多言語推論の最先端を超えることを実証します。
再現性を高めるために、コーパス、モデル、トレーニングパイプラインをオープンライセンスの下で hf.co/britllm/CuatroLLM でリリースします。

要約(オリジナル)

English, as a very high-resource language, enables the pretraining of high-quality large language models (LLMs). The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into French, German, and Spanish, resulting in a final 300B-token dataset, which we call TransWeb-Edu, and pretrain a 1.3B-parameter model, CuatroLLM, from scratch on this dataset. Across five non-English reasoning tasks, we show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2 and Gemma2, despite using an order of magnitude less data, such as about 6% of the tokens used for Llama3.2’s training. We further demonstrate that with additional domain-specific pretraining, amounting to less than 1% of TransWeb-Edu, CuatroLLM surpasses the state of the art in multilingual reasoning. To promote reproducibility, we release our corpus, models, and training pipeline under open licenses at hf.co/britllm/CuatroLLM.

arxiv情報

著者	Jiayi Wang,Yao Lu,Maurice Weber,Max Ryabinin,Yihong Chen,Raphael Tang,Pontus Stenetorp
発行日	2024-10-31 14:09:50+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー