A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

要約

高カーディナリティのカテゴリ変数は、データセットのサンプルサイズに比べてさまざまなレベルの数が多い変数、つまりレベルごとのデータポイントが少ない変数です。
機械学習手法では、カーディナリティの高い変数を使用すると問題が発生する可能性があります。
この記事では、最も成功している 2 つの機械学習手法であるツリーブースティングとディープニューラルネットワークのいくつかのバージョンと、カーディナリティの高いカテゴリ変数を含む複数の表形式のデータセットを使用した線形混合効果モデルを実証的に比較します。
まず、ランダム効果のある機械学習モデルは、ランダム効果のない従来の機械学習モデルよりも予測精度が高く、第二に、ランダム効果のあるツリーブースティングは、ランダム効果のあるディープニューラルネットワークよりも優れていることがわかりました。

要約(オリジナル)

High-cardinality categorical variables are variables for which the number of different levels is large relative to the sample size of a data set, or in other words, there are few data points per level. Machine learning methods can have difficulties with high-cardinality variables. In this article, we empirically compare several versions of two of the most successful machine learning methods, tree-boosting and deep neural networks, and linear mixed effects models using multiple tabular data sets with high-cardinality categorical variables. We find that, first, machine learning models with random effects have higher prediction accuracy than their classical counterparts without random effects, and, second, tree-boosting with random effects outperforms deep neural networks with random effects.

arxiv情報

著者	Fabio Sigrist
発行日	2023-07-05 07:26:27+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

A Comparison of Machine Learning Methods for Data with High-Cardinality Categorical Variables

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー