Language Variety Identification with True Labels

要約

言語識別は、多くの IR および NLP アプリケーションにおける重要な最初のステップです。
ただし、公開されているほとんどの言語識別データセットは、各インスタンスのゴールドラベルがテキストの取得元によって決定されるという前提でコンパイルされています。
調査によると、これは問題のある仮定であることが示されています。特に、非常に類似した言語 (クロアチア語とセルビア語など) や国語の種類 (ブラジル語とヨーロッパポルトガル語など) の場合、テキストに特定の言語または
バラエティ。
この重要な制限を克服するために、このペーパーでは、言語の多様性を識別するための最初の人間による注釈付き多言語データセットである DSL True Labels (DSL-TL) を紹介します。
DSL-TL には、合計 12,900 のポルトガル語のインスタンスが含まれており、ヨーロッパのポルトガル語とブラジルのポルトガル語に分かれています。
スペイン語、アルゼンチンスペイン語とカスティーリャスペイン語に分かれる。
英語、アメリカ英語とイギリス英語に分かれています。
これらの言語の種類を区別するために複数のモデルをトレーニングし、その結果を詳細に示します。
このホワイトペーパーで提示されているデータとモデルは、堅牢でより公正な言語多様性識別システムの開発に向けた信頼できるベンチマークを提供します。
DSL-TL を研究コミュニティが自由に利用できるようにします。

要約(オリジナル)

Language identification is an important first step in many IR and NLP applications. Most publicly available language identification datasets, however, are compiled under the assumption that the gold label of each instance is determined by where texts are retrieved from. Research has shown that this is a problematic assumption, particularly in the case of very similar languages (e.g., Croatian and Serbian) and national language varieties (e.g., Brazilian and European Portuguese), where texts may contain no distinctive marker of the particular language or variety. To overcome this important limitation, this paper presents DSL True Labels (DSL-TL), the first human-annotated multilingual dataset for language variety identification. DSL-TL contains a total of 12,900 instances in Portuguese, split between European Portuguese and Brazilian Portuguese; Spanish, split between Argentine Spanish and Castilian Spanish; and English, split between American English and British English. We trained multiple models to discriminate between these language varieties, and we present the results in detail. The data and models presented in this paper provide a reliable benchmark toward the development of robust and fairer language variety identification systems. We make DSL-TL freely available to the research community.

arxiv情報

著者	Marcos Zampieri,Kai North,Tommi Jauhiainen,Mariano Felice,Neha Kumari,Nishant Nair,Yash Bangera
発行日	2023-03-02 18:51:58+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Language Variety Identification with True Labels

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー