Larth: Dataset and Machine Translation for Etruscan

要約

エトルリア語は、紀元前 7 世紀から紀元 1 世紀までイタリアで話されていた古代言語です。
現在、この言語を母語とする人は存在せず、知られている碑文は約 12,000 個しかないため、そのリソースは不足しています。
私たちの知る限り、自然言語処理用の公的に利用可能なエトルリア語コーパスはありません。
したがって、既存の学術情報源からの 2891 の翻訳例を含む、エトルリア語から英語への機械翻訳用のデータセットを提案します。
一部の例は手動で抽出されますが、他の例は自動で取得されます。
データセットとともに、さまざまな機械翻訳モデルのベンチマークを行い、小規模なトランスフォーマーモデルで BLEU スコア 10.1 を達成できることを観察しました。
データセットを公開すると、この言語、類似言語、またはリソースが不足している他の言語に関する将来の研究が可能になります。

要約(オリジナル)

Etruscan is an ancient language spoken in Italy from the 7th century BC to the 1st century AD. There are no native speakers of the language at the present day, and its resources are scarce, as there exist only around 12,000 known inscriptions. To the best of our knowledge, there are no publicly available Etruscan corpora for natural language processing. Therefore, we propose a dataset for machine translation from Etruscan to English, which contains 2891 translated examples from existing academic sources. Some examples are extracted manually, while others are acquired in an automatic way. Along with the dataset, we benchmark different machine translation models observing that it is possible to achieve a BLEU score of 10.1 with a small transformer model. Releasing the dataset can help enable future research on this language, similar languages or other languages with scarce resources.

arxiv情報

著者	Gianluca Vico,Gerasimos Spanakis
発行日	2023-10-09 12:56:08+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Larth: Dataset and Machine Translation for Etruscan

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー