Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

要約

大規模言語モデル (LLM) は通常、英語などのリソースが豊富な言語に最適化されており、リソースが豊富な言語と過小評価されている言語との間のギャップがさらに悪化します。
この研究では、事前トレーニング、命令の調整、人間の好みへの調整という 3 つの重要な段階に焦点を当て、低リソース言語、特にバスク語での命令に従うことができるモデルを開発する戦略の詳細な分析を示しています。
私たちの調査結果は、約 6 億語の高品質なバスク語コーパスを使用した継続的な事前トレーニングにより、基礎モデルの自然言語理解 (NLU) が 12 ポイント以上向上することを示しています。
さらに、自動翻訳されたデータセットを使用した命令のチューニングと人間の好みの調整が非常に効果的であることが証明され、命令追従パフォーマンスが 24 ポイント向上しました。
結果として得られたモデル、Llama-eus-8B および Llama-eus-8B-instruct は、サブ 10B パラメータカテゴリにおけるバスク語の新しい最先端技術を確立します。

要約(オリジナル)

Large language models (LLMs) are typically optimized for resource-rich languages like English, exacerbating the gap between high-resource and underrepresented languages. This work presents a detailed analysis of strategies for developing a model capable of following instructions in a low-resource language, specifically Basque, by focusing on three key stages: pre-training, instruction tuning, and alignment with human preferences. Our findings demonstrate that continual pre-training with a high-quality Basque corpus of around 600 million words improves natural language understanding (NLU) of the foundational model by over 12 points. Moreover, instruction tuning and human preference alignment using automatically translated datasets proved highly effective, resulting in a 24-point improvement in instruction-following performance. The resulting models, Llama-eus-8B and Llama-eus-8B-instruct, establish a new state-of-the-art for Basque in the sub-10B parameter category.

arxiv情報

著者	Ander Corral,Ixak Sarasua,Xabier Saralegi
発行日	2024-12-18 15:05:59+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Pipeline Analysis for Developing Instruct LLMs in Low-Resource Languages: A Case Study on Basque

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー