PeLLE: Encoder-based language models for Brazilian Portuguese based on open data

要約

このペーパーでは、カロライナコーパスから厳選されたオープンデータに基づいてトレーニングされた、ブラジル系ポルトガル語向けの RoBERTa アーキテクチャに基づく大規模言語モデルファミリである PeLLE を紹介します。
再現可能な結果を目指して、モデルの事前トレーニングの詳細を説明します。
また、既存の多言語および PT-BR で洗練された事前トレーニング済みの Transformer ベースの LLM エンコーダーのセットに対して PeLLE モデルを評価し、いくつかの下流タスクにおける大規模な事前トレーニング済みモデルと小規模だが厳選された事前トレーニング済みモデルのパフォーマンスを対比します。
いくつかのタスクは大規模なモデルの方がパフォーマンスが向上しますが、一部のタスクは事前トレーニングで小規模ながら厳選されたデータから恩恵を受けると結論付けています。

要約(オリジナル)

In this paper we present PeLLE, a family of large language models based on the RoBERTa architecture, for Brazilian Portuguese, trained on curated, open data from the Carolina corpus. Aiming at reproducible results, we describe details of the pretraining of the models. We also evaluate PeLLE models against a set of existing multilingual and PT-BR refined pretrained Transformer-based LLM encoders, contrasting performance of large versus smaller-but-curated pretrained models in several downstream tasks. We conclude that several tasks perform better with larger models, but some tasks benefit from smaller-but-curated data in its pretraining.

arxiv情報

著者	Guilherme Lamartine de Mello,Marcelo Finger,and Felipe Serras,Miguel de Mello Carpi,Marcos Menon Jose,Pedro Henrique Domingues,Paulo Cavalim
発行日	2024-02-29 14:34:03+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

PeLLE: Encoder-based language models for Brazilian Portuguese based on open data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー