Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis

要約

低リソース言語の LLM のトレーニングでは、通常、機械翻訳 (MT) を使用した英語からのデータ拡張が利用されます。
しかし、これは LLM トレーニングに多くの課題をもたらします。ハイエンドの機械翻訳ソリューションを使用して大量のコンテンツを翻訳およびキュレーションするには多額のコストがかかります。
翻訳されたコンテンツには文化的な偏見が引き継がれます。
翻訳が忠実で正確でない場合、データ品質が低下し、トレーニングされたモデルに問題が発生します。
この研究では、言語モデルのトレーニングにおける翻訳データと合成データの役割を調査します。
私たちは、オープンな NLLB-3B MT モデルを使用して、3 ～ 4 歳児向けの 220 万の短編小説のデータセットである TinyStories を英語からアラビア語に翻訳します。
このデータを使用して、サイズ 1M ～ 33M パラメーターの多数のストーリー生成モデルをトレーニングします。
結果として得られたモデルで、品質およびタスク固有の問題を多数特定します。
これらの問題を修正するために、有能な LLM によって生成された、元のトレーニングデータの 1% に相当する、合成された高品質のアラビア語ストーリーの小さなデータセットを使用してモデルをさらに事前トレーニングします。
我々は、GPT-4 を判断材料として使用し、機械的解釈可能性から辞書学習分析を使用して、提案されたアプローチが機械翻訳の落とし穴の一部を解決する実用的な手段であることを示します。
言語的および文化的偏見の問題のケーススタディを通じて改善を説明します。

要約(オリジナル)

Training LLMs for low-resource languages usually utilizes data augmentation from English using machine translation (MT). This, however, brings a number of challenges to LLM training: there are large costs attached to translating and curating huge amounts of content with high-end machine translation solutions; the translated content carries over cultural biases; and if the translation is not faithful and accurate, data quality degrades causing issues in the trained model. In this work, we investigate the role of translation and synthetic data in training language models. We translate TinyStories, a dataset of 2.2M short stories for 3-4 year old children, from English to Arabic using the open NLLB-3B MT model. We train a number of story generation models of size 1M-33M parameters using this data. We identify a number of quality and task-specific issues in the resulting models. To rectify these issues, we further pre-train the models with a small dataset of synthesized high-quality Arabic stories generated by a capable LLM, representing 1% of the original training data. We show, using GPT-4 as a judge and Dictionary Learning Analysis from mechanistic interpretability, that the suggested approach is a practical means to resolve some of the machine translation pitfalls. We illustrate the improvements through case studies of linguistic and cultural bias issues.

arxiv情報

著者	Sabri Boughorbel,MD Rizwan Parvez,Majd Hawasly
発行日	2024-08-07 08:21:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Improving Language Models Trained on Translated Data with Continual Pre-Training and Dictionary Learning Analysis

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー