Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks

要約

多言語データで事前トレーニングされた大規模言語モデル (LLM) は、言語およびタスク固有のモデルパイプラインからさまざまなタスクに適応した単一モデルに移行することで、自然言語処理研究に革命をもたらしました。
ただし、LLM の既存の多言語 NLP ベンチマークの大部分は、言語の多様性がほとんどなく、少数の言語のみで評価データを提供します。
さらに、これらのベンチマークには、それぞれの最先端モデルに対する品質評価が欠けています。
この研究では、7 つの著名な LLM の詳細な調査を示しています: GPT-3.5-turbo、Llama 2-7B-Chat、Llama 3.1-8B、Bloomz 3B、Bloomz 7B1、Ministral-8B、および Whisper (大、中、小のバリアント)
22 のデータセットを使用した 17 のタスク、ゼロショット設定での 13.8 時間のスピーチ、およびそのパフォーマンス
最先端 (SOTA) モデルの比較と分析が行われています。
私たちの実験によると、SOTA モデルは現在、ゼロショット設定下のウルドゥー語 NLP タスクの大部分でエンコーダー/デコーダーモデルよりも優れたパフォーマンスを示しています。
ただし、Llama 3.1-8B と以前のバージョンの Llama 2-7B-Chat を比較すると、言語対応範囲が改善されているため、LLM はこれらの SOTA モデルを超えることができると推測できます。
私たちの結果は、Llama 3.1-8B のような、パラメーターは少ないが言語固有のデータが豊富なモデルが、いくつかのタスクにおいて GPT-3.5 などの言語多様性が低い大規模なモデルよりも優れていることが多いことを強調しています。

要約(オリジナル)

Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research, by transitioning from languages and task specific model pipelines to a single model adapted on a variety of tasks. However majority of existing multilingual NLP benchmarks for LLMs provide evaluation data in only few languages with little linguistic diversity. In addition these benchmarks lack quality assessment against the respective state-of the art models. This study presents an in-depth examination of 7 prominent LLMs: GPT-3.5-turbo, Llama 2-7B-Chat, Llama 3.1-8B, Bloomz 3B, Bloomz 7B1, Ministral-8B and Whisper (Large, medium and small variant) across 17 tasks using 22 datasets, 13.8 hours of speech, in a zero-shot setting, and their performance against state-of-the-art (SOTA) models, has been compared and analyzed. Our experiments show that SOTA models currently outperform encoder-decoder models in majority of Urdu NLP tasks under zero-shot settings. However, comparing Llama 3.1-8B over prior version Llama 2-7B-Chat, we can deduce that with improved language coverage, LLMs can surpass these SOTA models. Our results emphasize that models with fewer parameters but richer language-specific data, like Llama 3.1-8B, often outperform larger models with lower language diversity, such as GPT-3.5, in several tasks.

arxiv情報

著者	Munief Hassan Tahir,Sana Shams,Layba Fiaz,Farah Adeeba,Sarmad Hussain
発行日	2024-12-31 09:13:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Benchmarking the Performance of Pre-trained LLMs across Urdu NLP Tasks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー