A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

要約

クラスの不均衡は、データストリームの分類において新たな課題を引き起こします。
最近文献で提案されている多くのアルゴリズムは、さまざまなデータレベル、アルゴリズムレベル、アンサンブルのアプローチを使用してこの問題に取り組んでいます。
しかし、これらのアルゴリズムを評価する方法について、標準化され合意された手順やベンチマークが不足しています。
この研究では、多様で困難な不均衡なデータストリームシナリオの集合におけるアルゴリズムを評価するための、標準化された徹底的かつ包括的な実験フレームワークを提案します。
この実験研究では、静的および動的クラスの不均衡率、インスタンスレベルの困難さ、概念ドリフト、バイナリおよびマルチの現実世界および半合成データセットを組み合わせた 515 の不均衡データストリームに対して 24 の最先端のデータストリームアルゴリズムを評価します。
クラスシナリオ。
これは、データストリームマイニングドメインにおける最先端の分類器を比較する大規模な実験研究につながります。
これらの各シナリオにおける最先端の分類器の長所と短所について説明し、不均衡なデータストリームに最適なアルゴリズムを選択するための一般的な推奨事項をエンドユーザーに提供します。
さらに、この領域の未解決の課題と将来の方向性を策定します。
私たちの実験フレームワークは完全に再現可能であり、新しい方法で簡単に拡張できます。
このようにして、他の研究者が新しく提案された手法の完全で信頼できる公正な評価を作成するために使用できる、不均衡なデータストリームで実験を実施するための標準化されたアプローチを提案します。
実験的なフレームワークは https://github.com/canoalberto/imbalance-streams からダウンロードできます。

要約(オリジナル)

Class imbalance poses new challenges when it comes to classifying data streams. Many algorithms recently proposed in the literature tackle this problem using a variety of data-level, algorithm-level, and ensemble approaches. However, there is a lack of standardized and agreed-upon procedures and benchmarks on how to evaluate these algorithms. This work proposes a standardized, exhaustive, and comprehensive experimental framework to evaluate algorithms in a collection of diverse and challenging imbalanced data stream scenarios. The experimental study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced data streams that combine static and dynamic class imbalance ratios, instance-level difficulties, concept drift, real-world and semi-synthetic datasets in binary and multi-class scenarios. This leads to a large-scale experimental study comparing state-of-the-art classifiers in the data stream mining domain. We discuss the advantages and disadvantages of state-of-the-art classifiers in each of these scenarios and we provide general recommendations to end-users for selecting the best algorithms for imbalanced data streams. Additionally, we formulate open challenges and future directions for this domain. Our experimental framework is fully reproducible and easy to extend with new methods. This way, we propose a standardized approach to conducting experiments in imbalanced data streams that can be used by other researchers to create complete, trustworthy, and fair evaluation of newly proposed methods. Our experimental framework can be downloaded from https://github.com/canoalberto/imbalanced-streams.

arxiv情報

著者	Gabriel Aguiar,Bartosz Krawczyk,Alberto Cano
発行日	2023-07-18 15:28:39+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー