Text classification dataset and analysis for Uzbek language

要約

テキスト分類は自然言語処理 (NLP) における重要なタスクであり、その目標はテキストデータを定義済みのクラスに分類することです。
この研究では、テキスト分類の一部として、マルチラベルニュース分類タスクのデータセット作成手順と評価手法を分析します。
最初に、ウズベク語のテキスト分類用に新たに取得したデータセットを提示します。これは、10 の異なるニュースおよび報道 Web サイトから収集され、ニュース、報道、および法律のテキストの 15 のカテゴリをカバーしています。
また、この新しく作成されたデータセットについて、従来の bag-of-words モデルからディープラーニングアーキテクチャまで、さまざまなモデルの包括的な評価を提示します。
私たちの実験では、リカレントニューラルネットワーク (RNN) および畳み込みニューラルネットワーク (CNN) ベースのモデルがルールベースのモデルよりも優れていることが示されています。
最高のパフォーマンスは、ウズベク語コーパスでトレーニングされた変換器ベースの BERT モデルである BERTbek モデルによって達成されます。
私たちの調査結果は、ウズベク語のテキスト分類をさらに研究するための優れたベースラインを提供します。

要約(オリジナル)

Text classification is an important task in Natural Language Processing (NLP), where the goal is to categorize text data into predefined classes. In this study, we analyse the dataset creation steps and evaluation techniques of multi-label news categorisation task as part of text classification. We first present a newly obtained dataset for Uzbek text classification, which was collected from 10 different news and press websites and covers 15 categories of news, press and law texts. We also present a comprehensive evaluation of different models, ranging from traditional bag-of-words models to deep learning architectures, on this newly created dataset. Our experiments show that the Recurrent Neural Network (RNN) and Convolutional Neural Network (CNN) based models outperform the rule-based models. The best performance is achieved by the BERTbek model, which is a transformer-based BERT model trained on the Uzbek corpus. Our findings provide a good baseline for further research in Uzbek text classification.

arxiv情報

著者	Elmurod Kuriyozov,Ulugbek Salaev,Sanatbek Matlatipov,Gayrat Matlatipov
発行日	2023-02-28 11:21:24+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Text classification dataset and analysis for Uzbek language

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー