Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

要約

データセットのキュレーションは、強力な大規模な言語モデル（LLM）パフォーマンスの基礎となっています。
英語と多言語のデータセットには、さまざまなルールベースのフィルタリングヒューリスティックが存在しますが、モデルベースのフィルタリング手法は主に英語に焦点を当てています。
英語以外の言語に関する限られた研究に由来する格差に対処するために、多様なデータセットのモデルベースのフィルタリングフレームワークを提案します。
私たちのアプローチは、透明性、シンプルさ、効率性を強調し、トランスとファストテキストベースの分類子を活用して、技術とデータの幅広いアクセシビリティを確保しています。
私たちは、さまざまな言語ファミリ、スクリプト、リソースの可用性を介してFineWeb-2 Webクロールデータセットに関する包括的なアブレーション研究を実施して、メソッドの有効性を実証しています。
70Bおよび119Bトークンの1Bパラメーターラマモデルをトレーニングすると、私たちのアプローチはベースラインMMLUスコアとトレーニングトークンのわずか15％と一致させ、他のベンチマーク全体で改善します。
これらの調査結果は、他の言語へのアプローチの一般化性に関する強力な証拠を提供します。
その結果、フレームワークを20の言語に拡張し、そのために洗練された事前トレーニングデータセットをリリースします。

要約(オリジナル)

Dataset curation has become a basis for strong large language model (LLM) performance. While various rule-based filtering heuristics exist for English and multilingual datasets, model-based filtering techniques have primarily focused on English. To address the disparity stemming from limited research on non-English languages, we propose a model-based filtering framework for multilingual datasets that aims to identify a diverse set of structured and knowledge-rich samples. Our approach emphasizes transparency, simplicity, and efficiency, leveraging Transformer- and FastText-based classifiers to ensure the broad accessibility of our technique and data. We conduct comprehensive ablation studies on the FineWeb-2 web crawl dataset across diverse language families, scripts, and resource availability to demonstrate the effectiveness of our method. Training a 1B-parameter Llama model for 70B and 119B tokens, our approach can match the baseline MMLU score with as little as 15% of the training tokens, while also improving across other benchmarks. These findings provide strong evidence for the generalizability of our approach to other languages. As a result, we extend our framework to 20 languages for which we release the refined pretraining datasets.

arxiv情報

著者	Bettina Messmer,Vinko Sabolčec,Martin Jaggi
発行日	2025-02-14 18:42:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Enhancing Multilingual LLM Pretraining with Model-Based Data Selection

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー