Efficient Hybrid Oversampling and Intelligent Undersampling for Imbalanced Big Data Classification

要約

不均衡な分類は、現実世界の多くのアプリケーションが直面するよく知られた課題です。
この問題は、ターゲット変数の分布が歪んでいて、多数派クラスへの予測バイアスが生じている場合に発生します。
ビッグデータ時代の到来に伴い、この問題を解決するための効率的なソリューションが急務となっています。
この研究では、MapReduce フレームワークを使用してインテリジェントなアンダーサンプリングとオーバーサンプリングを組み合わせた、SMOTENN と呼ばれる新しいリサンプリング方法を紹介します。
どちらの手順もデータ上の同じパス上で実行され、この手法に効率性がもたらされます。
SMOTENN メソッドは、少数派サンプルに関連する近傍の効率的な実装によって補完されます。
私たちの実験結果は、このアプローチの長所を示しており、中小規模のデータセットでは他のリサンプリング手法よりも優れたパフォーマンスを示し、大規模なデータセットでは実行時間を短縮して良好な結果を達成しています。

要約(オリジナル)

Imbalanced classification is a well-known challenge faced by many real-world applications. This issue occurs when the distribution of the target variable is skewed, leading to a prediction bias toward the majority class. With the arrival of the Big Data era, there is a pressing need for efficient solutions to solve this problem. In this work, we present a novel resampling method called SMOTENN that combines intelligent undersampling and oversampling using a MapReduce framework. Both procedures are performed on the same pass over the data, conferring efficiency to the technique. The SMOTENN method is complemented with an efficient implementation of the neighborhoods related to the minority samples. Our experimental results show the virtues of this approach, outperforming alternative resampling techniques for small- and medium-sized datasets while achieving positive results on large datasets with reduced running times.

arxiv情報

著者	Carla Vairetti,José Luis Assadi,Sebastián Maldonado
発行日	2023-10-09 15:22:13+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Efficient Hybrid Oversampling and Intelligent Undersampling for Imbalanced Big Data Classification

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー