Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

要約

{\bf Swan} はアラビア語を中心とした埋め込みモデルファミリであり、小規模と大規模の両方のユースケースに対応します。
Swan には、ARBERTv2 に基づく Swan-Small と、事前トレーニングされたアラビア語大規模言語モデルである ArMistral に基づいて構築された Swan-Large の 2 つのバリアントが含まれています。
これらのモデルを評価するために、私たちは、8 つの多様なタスクと 94 のデータセットにわたる、異言語、多方言、マルチドメイン、および多文化のアラビア語テキストの埋め込みパフォーマンスを評価する包括的なベンチマークスイートである ArabicMTEB を提案します。
Swan-Large は、ほとんどのアラビア語タスクで Multilingual-E5-large を上回る最先端の結果を達成し、Swan-Small は一貫して Multilingual-E5-base を上回ります。
私たちの広範な評価は、Swan モデルが方言と文化の両方を認識しており、さまざまなアラビア語の領域にわたって優れていると同時に、大幅な金銭的効率を提供していることを示しています。
この成果は、アラビア語モデリングの分野を大きく前進させ、アラビア語の自然言語処理における将来の研究と応用に貴重なリソースを提供します。
私たちのモデルとベンチマークは研究のために一般公開されます。

要約(オリジナル)

We introduce {\bf Swan}, a family of embedding models centred around the Arabic language, addressing both small-scale and large-scale use cases. Swan includes two variants: Swan-Small, based on ARBERTv2, and Swan-Large, built on ArMistral, a pretrained Arabic large language model. To evaluate these models, we propose ArabicMTEB, a comprehensive benchmark suite that assesses cross-lingual, multi-dialectal, multi-domain, and multi-cultural Arabic text embedding performance, covering eight diverse tasks and spanning 94 datasets. Swan-Large achieves state-of-the-art results, outperforming Multilingual-E5-large in most Arabic tasks, while the Swan-Small consistently surpasses Multilingual-E5-base. Our extensive evaluations demonstrate that Swan models are both dialectally and culturally aware, excelling across various Arabic domains while offering significant monetary efficiency. This work significantly advances the field of Arabic language modelling and provides valuable resources for future research and applications in Arabic natural language processing. Our models and benchmark will be made publicly accessible for research.

arxiv情報

著者	Gagan Bhatia,El Moatez Billah Nagoudi,Abdellah El Mekki,Fakhraddin Alwajih,Muhammad Abdul-Mageed
発行日	2024-11-06 11:19:01+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Swan and ArabicMTEB: Dialect-Aware, Arabic-Centric, Cross-Lingual, and Cross-Cultural Embedding Models and Benchmarks

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー