Pretraining Data and Tokenizer for Indic LLM

要約

多言語インド大規模言語モデルを開発するためのデータ準備に対する新しいアプローチを紹介します。
当社の細心の注意を払ったデータ収集は、Common Crawl、インドの書籍、ニュース記事、Wikipedia を含むオープンソースおよび独自のソースに及び、多様で豊かな言語表現を保証します。
インド言語ごとに、冗長で低品質のテキストコンテンツを効果的に排除するカスタム前処理パイプラインを設計します。
さらに、クロールされた Web ページの 70% に存在する冗長性に対処するために、共通クロールデータに対して重複排除を実行します。
この研究は、高品質のデータの開発と、インド言語で優れたパフォーマンスを実現するように設計された、3B および 7B パラメーターを備えたインドの大規模言語モデルの多言語データセットのトークン化の最適化に焦点を当てています。
新しい多言語トークナイザートレーニング戦略を導入し、カスタムトレーニングされたインド語トークナイザーが最先端の OpenAI Tiktoken トークナイザーを上回り、インド語で優れたトークン対単語比を達成することを実証します。

要約(オリジナル)

We present a novel approach to data preparation for developing multilingual Indic large language model. Our meticulous data acquisition spans open-source and proprietary sources, including Common Crawl, Indic books, news articles, and Wikipedia, ensuring a diverse and rich linguistic representation. For each Indic language, we design a custom preprocessing pipeline to effectively eliminate redundant and low-quality text content. Additionally, we perform deduplication on Common Crawl data to address the redundancy present in 70% of the crawled web pages. This study focuses on developing high-quality data, optimizing tokenization for our multilingual dataset for Indic large language models with 3B and 7B parameters, engineered for superior performance in Indic languages. We introduce a novel multilingual tokenizer training strategy, demonstrating our custom-trained Indic tokenizer outperforms the state-of-the-art OpenAI Tiktoken tokenizer, achieving a superior token-to-word ratio for Indic languages.

arxiv情報

著者	Rahul Kumar,Shubham Kakde,Divyansh Rajput,Daud Ibrahim,Rishabh Nahata,Pidathala Sowjanya,Deepak Kumar
発行日	2024-07-17 11:06:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pretraining Data and Tokenizer for Indic LLM

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー