Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models

要約

高品質のデータリソースは、特に広東語のような低リソース言語では、大規模な言語モデル（LLM）を学習する上で重要な役割を果たします。
8,500万人以上のネイティブスピーカーを抱えているにもかかわらず、広東語は、広東語の支配、特徴を話すコミュニティ内の凝集の欠如、キャラクターエンコードと入力方法の多様性の欠如、海外広東語のスピーカーのイギリスを好む傾向などの要因により、自然言語処理の分野（NLP）の低リソース言語と見なされています。
さらに、広東語、英語のローンワード、およびコードスイッチング特性の豊富な口語の語彙は、コーパスの収集と処理の複雑さを増します。
これらの課題に対処するために、オープンソースコーパス、香港固有のフォーラム、ウィキペディア、一般的なクロールデータなど、さまざまなソースから広東語のテキストを収集します。
言語フィルタリング、品質フィルタリング、コンテンツフィルタリング、および重複脱重ステップを通じて厳格なデータ処理を実施し、大規模な言語モデルをトレーニングするために20億を超えるトークンの高品質の広東コーパスの構築に成功しました。
さらに、キュレーションされた広東語のタスクで監視された微調整（SFT）を通じてモデルを改良し、特定のアプリケーションを処理する能力を高めました。
トレーニングが完了すると、モデルは4つの広東語のベンチマークで最先端の（SOTA）パフォーマンスを実現します。
データセットでトレーニングした後、モデルは他の主流の言語タスクのパフォーマンスの向上も示します。

要約(オリジナル)

High-quality data resources play a crucial role in learning large language models (LLMs), particularly for low-resource languages like Cantonese. Despite having more than 85 million native speakers, Cantonese is still considered a low-resource language in the field of natural language processing (NLP) due to factors such as the dominance of Mandarin, lack of cohesion within the Cantonese-speaking community, diversity in character encoding and input methods, and the tendency of overseas Cantonese speakers to prefer using English. In addition, rich colloquial vocabulary of Cantonese, English loanwords, and code-switching characteristics add to the complexity of corpus collection and processing. To address these challenges, we collect Cantonese texts from a variety of sources, including open source corpora, Hong Kong-specific forums, Wikipedia, and Common Crawl data. We conduct rigorous data processing through language filtering, quality filtering, content filtering, and de-duplication steps, successfully constructing a high-quality Cantonese corpus of over 2 billion tokens for training large language models. We further refined the model through supervised fine-tuning (SFT) on curated Cantonese tasks, enhancing its ability to handle specific applications. Upon completion of the training, the model achieves state-of-the-art (SOTA) performance on four Cantonese benchmarks. After training on our dataset, the model also exhibits improved performance on other mainstream language tasks.

arxiv情報

著者	Jiyue Jiang,Alfred Kar Yin Truong,Yanyu Chen,Qinghang Bao,Sheng Wang,Pengan Chen,Jiuming Wang,Lingpeng Kong,Yu Li,Chuan Wu
発行日	2025-03-05 17:53:07+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Developing and Utilizing a Large-Scale Cantonese Dataset for Multi-Tasking in Large Language Models

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー