AlleNoise: large-scale text classification benchmark dataset with real-world label noise

要約

ラベルノイズは、堅牢な分類モデルをトレーニングする上で依然として課題です。
ラベルノイズを軽減するほとんどの方法は、主に合成ノイズを含むデータセットを使用してベンチマークされています。
現実的なノイズ分布を備えたデータセットのニーズは、WebVision や Clothing1M などの Web スクレーピングされたベンチマークによって部分的に対処されていますが、それらのベンチマークはコンピュータービジョンドメインに限定されています。
Transformer ベースのモデルの重要性が高まっているため、ノイズの多いラベルを使用した学習のためのテキスト分類ベンチマークを確立することが重要です。
このペーパーでは、実世界のインスタンス依存のラベルノイズを備えた新しい厳選されたテキスト分類ベンチマークデータセットである AlleNoise を紹介します。これには、約 5,600 のクラスにわたる 500,000 を超える例が含まれており、意味のある階層的なカテゴリー分類法で補完されています。
ノイズ分布は主要な電子商取引市場の実際のユーザーから得られたものであるため、人的ミスのセマンティクスを現実的に反映しています。
ノイズの多いラベルに加えて、人間が検証したクリーンなラベルも提供します。これは、現場で通常使用される Web スクレイピングされたデータセットとは異なり、ノイズ分布についてより深い洞察を得るのに役立ちます。
我々は、ノイズの多いラベルを使用した学習のために確立された代表的な方法を選択しただけでは、このような現実世界のノイズを処理するには不十分であることを示します。
さらに、これらのアルゴリズムが過剰な暗記を軽減しないという証拠を示します。
そのため、AlleNoise では、テキスト分類タスクにおける現実世界のラベルノイズを処理できるラベルノイズ手法の開発に高いハードルを設定しました。
コードとデータセットは、https://github.com/allegro/AlleNoise からダウンロードできます。

要約(オリジナル)

Label noise remains a challenge for training robust classification models. Most methods for mitigating label noise have been benchmarked using primarily datasets with synthetic noise. While the need for datasets with realistic noise distribution has partially been addressed by web-scraped benchmarks such as WebVision and Clothing1M, those benchmarks are restricted to the computer vision domain. With the growing importance of Transformer-based models, it is crucial to establish text classification benchmarks for learning with noisy labels. In this paper, we present AlleNoise, a new curated text classification benchmark dataset with real-world instance-dependent label noise, containing over 500,000 examples across approximately 5,600 classes, complemented with a meaningful, hierarchical taxonomy of categories. The noise distribution comes from actual users of a major e-commerce marketplace, so it realistically reflects the semantics of human mistakes. In addition to the noisy labels, we provide human-verified clean labels, which help to get a deeper insight into the noise distribution, unlike web-scraped datasets typically used in the field. We demonstrate that a representative selection of established methods for learning with noisy labels is inadequate to handle such real-world noise. In addition, we show evidence that these algorithms do not alleviate excessive memorization. As such, with AlleNoise, we set the bar high for the development of label noise methods that can handle real-world label noise in text classification tasks. The code and dataset are available for download at https://github.com/allegro/AlleNoise.

arxiv情報

著者	Alicja Rączkowska,Aleksandra Osowska-Kurczab,Jacek Szczerbiński,Kalina Jasinska-Kobus,Klaudia Nazarko
発行日	2024-10-23 16:19:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

AlleNoise: large-scale text classification benchmark dataset with real-world label noise

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー