Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

要約

表形式のデータは取得が難しく、欠損値が発生する可能性があります。
この論文では、スコアベースの拡散と条件付きフローマッチングを使用して、混合タイプ (連続的およびカテゴリカル) の表形式データを生成および入力するための新しいアプローチを提案します。
関数近似器としてニューラルネットワークに依存する以前の研究とは対照的に、代わりに、人気のある勾配ブーストツリー (GBT) メソッドである XGBoost を利用します。
エレガントであることに加えて、私たちの方法は、i) トレーニングデータセットがクリーンであるか欠損データによって汚染されている場合に、非常に現実的な合成データを生成し、ii) 多様で妥当なデータ代入を生成することを、さまざまなデータセットで経験的に示しています。
私たちの方法は多くの場合、ディープラーニング生成方法よりも優れており、GPU を必要とせずに CPU を使用して並列トレーニングできます。
簡単にアクセスできるようにするために、PyPI の Python ライブラリと CRAN の R パッケージを通じてコードをリリースします。

要約(オリジナル)

Tabular data is hard to acquire and is subject to missing values. This paper proposes a novel approach to generate and impute mixed-type (continuous and categorical) tabular data using score-based diffusion and conditional flow matching. Contrary to previous work that relies on neural networks as function approximators, we instead utilize XGBoost, a popular Gradient-Boosted Tree (GBT) method. In addition to being elegant, we empirically show on various datasets that our method i) generates highly realistic synthetic data when the training dataset is either clean or tainted by missing data and ii) generates diverse plausible data imputations. Our method often outperforms deep-learning generation methods and can trained in parallel using CPUs without the need for a GPU. To make it easily accessible, we release our code through a Python library on PyPI and an R package on CRAN.

arxiv情報

著者	Alexia Jolicoeur-Martineau,Kilian Fatras,Tal Kachman
発行日	2023-09-18 17:49:09+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Generating and Imputing Tabular Data via Diffusion and Flow-based Gradient-Boosted Trees

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー