Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class

要約

クラス分布の不均衡の処理表形式データ上の分類器を構築することは、長年の関心の問題となっています。
人気のあるアプローチの1つは、合成されたデータを使用してトレーニングデータセットを増強することです。
古典的な増強技術は、既存の少数派クラスの例の線形補間に限定されていましたが、最近、より高い容量の深い生成モデルがより大きな約束を提供しています。
ただし、深い生成モデルを構築する際のクラス分布における不均衡の処理も困難な問題であり、不均衡な分類器モデルトレーニングとして広範囲に研究されていません。
最先端の深い生成モデルは、多数派の例よりもかなり低品質の少数派の例をもたらすことを示しています。
％このホワイトペーパーでは、少数派クラスを過小評価する生成モデルの不均衡なデータトレーニングを訓練した不均衡なデータセットを観察することから始めます。
少数派と多数派の分布が重複する地域のクラスを導入することにより、バイナリクラスのラベルを三元クラスラベルに変換するという新しい手法を提案します。
トレーニングセットのこの前処理だけで、いくつかの最先端の拡散およびGANベースのモデルにまたがるデータの品質が大幅に向上することを示します。
合成データを使用して分類器をトレーニングしている間、トレーニングデータからオーバーラップクラスを削除し、精度の向上の理由を正当化します。
4つの実際のデータセット、5つの異なる分類子、5つの生成モデルで広範な実験を実行し、この方法が最新モデルのシンセサイザーパフォーマンスだけでなく、分類器のパフォーマンスも強化することを実証します。

要約(オリジナル)

Handling imbalance in class distribution when building a classifier over tabular data has been a problem of long-standing interest. One popular approach is augmenting the training dataset with synthetically generated data. While classical augmentation techniques were limited to linear interpolation of existing minority class examples, recently higher capacity deep generative models are providing greater promise. However, handling of imbalance in class distribution when building a deep generative model is also a challenging problem, that has not been studied as extensively as imbalanced classifier model training. We show that state-of-the-art deep generative models yield significantly lower-quality minority examples than majority examples. %In this paper, we start with the observation that imbalanced data training of generative models trained imbalanced dataset which under-represent the minority class. We propose a novel technique of converting the binary class labels to ternary class labels by introducing a class for the region where minority and majority distributions overlap. We show that just this pre-processing of the training set, significantly improves the quality of data generated spanning several state-of-the-art diffusion and GAN-based models. While training the classifier using synthetic data, we remove the overlap class from the training data and justify the reasons behind the enhanced accuracy. We perform extensive experiments on four real-life datasets, five different classifiers, and five generative models demonstrating that our method enhances not only the synthesizer performance of state-of-the-art models but also the classifier performance.

arxiv情報

著者	Annie D’souza,Swetha M,Sunita Sarawagi
発行日	2025-02-19 15:36:45+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Synthetic Tabular Data Generation for Imbalanced Classification: The Surprising Effectiveness of an Overlap Class

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー