Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing

要約

このペーパーでは、4GB VRAM のみを備えたシステムで量子化低ランク適応 (QLoRA) を使用してアラビア語処理用に Qwen2-1.5B モデルを微調整する新しいアプローチを紹介します。
バクトリア語、OpenAssistant、Wikipedia アラビア語コーパスなどの多様なデータセットを使用して、この大規模な言語モデルをアラビア語ドメインに適応させるプロセスを詳しく説明します。
私たちの方法論には、カスタムデータの前処理、モデル構成、および勾配累積や混合精度トレーニングなどのトレーニング最適化手法が含まれます。
私たちは、形態学的複雑さ、方言のバリエーション、発音区別記号の処理など、アラビア語 NLP の特定の課題に取り組みます。
10,000 トレーニングステップを超える実験結果では、パフォーマンスが大幅に向上し、最終的な損失は 0.1083 に収束しました。
テキスト分類、質問応答、方言識別など、さまざまなアラビア語タスクにわたる GPU メモリ使用量、トレーニングダイナミクス、モデル評価の包括的な分析を提供します。
微調整されたモデルは、入力摂動に対する堅牢性と、アラビア語特有の言語現象の処理の改善を示しています。
この研究は、特殊な言語モデルを作成するためのリソース効率の高いアプローチを実証することで多言語 AI に貢献し、多様な言語コミュニティにとって高度な NLP テクノロジーへのアクセスを民主化する可能性があります。
私たちの研究は、低リソースの言語適応と大規模な言語モデルの効率的な微調整に関する将来の研究への道を開きます。

要約(オリジナル)

This paper presents a novel approach to fine-tuning the Qwen2-1.5B model for Arabic language processing using Quantized Low-Rank Adaptation (QLoRA) on a system with only 4GB VRAM. We detail the process of adapting this large language model to the Arabic domain, using diverse datasets including Bactrian, OpenAssistant, and Wikipedia Arabic corpora. Our methodology involves custom data preprocessing, model configuration, and training optimization techniques such as gradient accumulation and mixed-precision training. We address specific challenges in Arabic NLP, including morphological complexity, dialectal variations, and diacritical mark handling. Experimental results over 10,000 training steps show significant performance improvements, with the final loss converging to 0.1083. We provide comprehensive analysis of GPU memory usage, training dynamics, and model evaluation across various Arabic language tasks, including text classification, question answering, and dialect identification. The fine-tuned model demonstrates robustness to input perturbations and improved handling of Arabic-specific linguistic phenomena. This research contributes to multilingual AI by demonstrating a resource-efficient approach for creating specialized language models, potentially democratizing access to advanced NLP technologies for diverse linguistic communities. Our work paves the way for future research in low-resource language adaptation and efficient fine-tuning of large language models.

arxiv情報

著者	Prakash Aryan
発行日	2024-12-23 13:08:48+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Resource-Aware Arabic LLM Creation: Model Adaptation, Integration, and Multi-Domain Testing

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー