OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

要約

大規模言語モデル (LLM) は顕著な機能を実証していますが、その成功は事前トレーニングコーパスの品質に大きく依存します。
中国の LLM にとって、高品質の中国データセットの不足は重大な課題となり、パフォーマンスが制限されることがよくあります。
この問題に対処するために、私たちは、LLM の事前トレーニング、事後トレーニング、微調整用に特別に設計された一連の高品質データセットである OpenCSG 中国語コーパスを提案します。
このコーパスには、Fineweb-edu-chinese、Fineweb-edu-chinese-v2、Cosmopedia-chinese、および Smoltalk-chinese が含まれており、それぞれに明確な特徴があります。Fineweb-edu データセットは、さまざまな中国の Web ソースから派生したフィルタリングされた高品質のコンテンツに焦点を当てています。
Cosmopedia-chinese は、知識集約型トレーニング用の教科書形式の合成データを提供します。
Smoltalk-chinese は、文体的で多様なチャット形式のデータを重視しています。
OpenCSG 中国語コーパスは、高品質のテキスト、ドメイン全体にわたる多様な網羅性、およびスケーラブルで再現可能なデータキュレーションプロセスが特徴です。
さらに、より小さなパラメーターモデルの評価を含む広範な実験分析を実施しました。これにより、C-Eval などのタスクで大幅なパフォーマンスの向上が実証され、中国人 LLM のトレーニングに対するコーパスの有効性が示されました。

要約(オリジナル)

Large language models (LLMs) have demonstrated remarkable capabilities, but their success heavily relies on the quality of pretraining corpora. For Chinese LLMs, the scarcity of high-quality Chinese datasets presents a significant challenge, often limiting their performance. To address this issue, we propose the OpenCSG Chinese Corpus, a series of high-quality datasets specifically designed for LLM pretraining, post-training, and fine-tuning. This corpus includes Fineweb-edu-chinese, Fineweb-edu-chinese-v2, Cosmopedia-chinese, and Smoltalk-chinese, each with distinct characteristics: Fineweb-edu datasets focus on filtered, high-quality content derived from diverse Chinese web sources; Cosmopedia-chinese provides synthetic, textbook-style data for knowledge-intensive training; and Smoltalk-chinese emphasizes stylistic and diverse chat-format data. The OpenCSG Chinese Corpus is characterized by its high-quality text, diverse coverage across domains, and scalable, reproducible data curation processes. Additionally, we conducted extensive experimental analyses, including evaluations on smaller parameter models, which demonstrated significant performance improvements in tasks such as C-Eval, showcasing the effectiveness of the corpus for training Chinese LLMs.

arxiv情報

著者	Yijiong Yu,Ziyun Dai,Zekun Wang,Wei Wang,Ran Chen,Ji Pei
発行日	2025-01-14 15:22:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

OpenCSG Chinese Corpus: A Series of High-quality Chinese Datasets for LLM Training

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー