Distributionally robust self-supervised learning for tabular data

要約

経験的リスク最小化 (ERM) を使用してトレーニングされた機械学習 (ML) モデルは、多くの場合、エラースライスとして知られる表形式データの特定の部分母集団で系統的なエラーを示します。
エラースライスの存在下でロバストな表現を学習することは、カーディナリティの高い特徴とエラーセットの構築の複雑さのため、特に特徴再構成フェーズ中の自己教師あり設定では困難です。
従来の堅牢な表現学習手法は主に、コンピュータービジョンの教師あり設定における最悪のグループパフォーマンスを改善することに重点が置かれており、表形式データに合わせたアプローチにはギャップが残されています。
私たちは、自己教師付き事前トレーニング中に表形式データの堅牢な表現を学習するフレームワークを開発することで、このギャップに対処します。
私たちのアプローチでは、マスク言語モデリング (MLM) 損失でトレーニングされたエンコーダー/デコーダーモデルを利用して、堅牢な潜在表現を学習します。
このペーパーでは、表形式データの事前トレーニング段階で Just Train Twice (JTT) メソッドと Deep Feature Reweighting (DFR) メソッドを適用します。
これらの方法では、エラーが発生しやすいサンプルの重み付けを高めたり、特定のカテゴリ特徴のバランスのとれたデータセットを作成したりすることで、ERM 事前トレーニングモデルを微調整します。
これにより、各特徴に特化したモデルが作成され、それがアンサンブルアプローチで使用され、下流の分類パフォーマンスが向上します。
この方法により、スライス全体の堅牢性が向上し、全体的な汎化パフォーマンスが向上します。
さまざまなデータセットにわたる広範な実験により、私たちのアプローチの有効性が実証されています。
コードは \url{https://github.com/amazon-science/distributionally-robust-self-supervised-learning-for-tabular-data} から入手できます。

要約(オリジナル)

Machine learning (ML) models trained using Empirical Risk Minimization (ERM) often exhibit systematic errors on specific subpopulations of tabular data, known as error slices. Learning robust representation in presence of error slices is challenging, especially in self-supervised settings during the feature reconstruction phase, due to high cardinality features and the complexity of constructing error sets. Traditional robust representation learning methods are largely focused on improving worst group performance in supervised setting in computer vision, leaving a gap in approaches tailored for tabular data. We address this gap by developing a framework to learn robust representation in tabular data during self-supervised pre-training. Our approach utilizes an encoder-decoder model trained with Masked Language Modeling (MLM) loss to learn robust latent representations. This paper applies the Just Train Twice (JTT) and Deep Feature Reweighting (DFR) methods during the pre-training phase for tabular data. These methods fine-tune the ERM pre-trained model by up-weighting error-prone samples or creating balanced datasets for specific categorical features. This results in specialized models for each feature, which are then used in an ensemble approach to enhance downstream classification performance. This methodology improves robustness across slices, thus enhancing overall generalization performance. Extensive experiments across various datasets demonstrate the efficacy of our approach. The code is available: \url{https://github.com/amazon-science/distributionally-robust-self-supervised-learning-for-tabular-data}.

arxiv情報

著者	Shantanu Ghosh,Tiankang Xie,Mikhail Kuznetsov
発行日	2024-12-04 17:10:06+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distributionally robust self-supervised learning for tabular data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー