MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

要約

タイトル：MMT：マルチリンガルかつマルチトピックなインドのソーシャルメディアデータセット
要約：
– ソーシャルメディアは異文化間のコミュニケーションにおいて重要な役割を果たしている。
– しかし、多言語でのコードミックス形式で行われることが多く、言語識別、トピックモデリング、固有表現認識などの自然言語処理ツールにとっては大きな課題がある。
– この問題に対処するために、インドのコンテキストで13のコースグレインと63のファイングレインのトピックを網羅する、大規模なマルチリンガルかつマルチトピックなデータセットであるMMTを紹介する。
– MMTデータセットから5,346のツイートをサブセットとして抽出し、インドの言語とそのコードミックスの対応をアノテーションしている。
– また、既存のツールがMMTの言語多様性を適切に捉えることができないことを、トピックモデリングと言語識別の2つのダウンストリームタスクにおいて示している。
– 今後の研究を促進するために、匿名化されたアノテーションされたデータセットを公開する予定である。

要約(オリジナル)

Social media plays a significant role in cross-cultural communication. A vast amount of this occurs in code-mixed and multilingual form, posing a significant challenge to Natural Language Processing (NLP) tools for processing such information, like language identification, topic modeling, and named-entity recognition. To address this, we introduce a large-scale multilingual, and multi-topic dataset (MMT) collected from Twitter (1.7 million Tweets), encompassing 13 coarse-grained and 63 fine-grained topics in the Indian context. We further annotate a subset of 5,346 tweets from the MMT dataset with various Indian languages and their code-mixed counterparts. Also, we demonstrate that the currently existing tools fail to capture the linguistic diversity in MMT on two downstream tasks, i.e., topic modeling and language identification. To facilitate future research, we will make the anonymized and annotated dataset available in the public domain.

arxiv情報

著者	Dwip Dalal,Vivek Srivastava,Mayank Singh
発行日	2023-04-02 21:39:00+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

MMT: A Multilingual and Multi-Topic Indian Social Media Dataset

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー