Selecting Robust Features for Machine Learning Applications using Multidata Causal Discovery

要約

タイトル: Multidata因果推定を用いた機械学習アプリケーションのための堅牢な特徴量選択

要約:

– 可能なドライバの推定が困難な場合、機械学習モデルの信頼性と解釈性にとって、堅牢な特徴量の選択は重要である。
– 本研究では、M-因果特徴量選択手法を提案し、複数の時間系列データセットを同時に処理し、単一の因果ドライバセットを提供する。
– この手法は、条件付き独立性検定を利用し、因果グラフの一部を推定する因果発見アルゴリズムPC1またはPCMCIを用いる。
– 因果的に関連しないリンクを除外することで、残りの因果特徴を多変量線形回帰、ランダムフォレストなどの機械学習モデルの入力にする。
– 西太平洋の台風の統計的強度予測という応用において、本手法は他の特徴量選択手法を上回り、因果的なドライバを提供することができる。

要約(オリジナル)

Robust feature selection is vital for creating reliable and interpretable Machine Learning (ML) models. When designing statistical prediction models in cases where domain knowledge is limited and underlying interactions are unknown, choosing the optimal set of features is often difficult. To mitigate this issue, we introduce a Multidata (M) causal feature selection approach that simultaneously processes an ensemble of time series datasets and produces a single set of causal drivers. This approach uses the causal discovery algorithms PC1 or PCMCI that are implemented in the Tigramite Python package. These algorithms utilize conditional independence tests to infer parts of the causal graph. Our causal feature selection approach filters out causally-spurious links before passing the remaining causal features as inputs to ML models (Multiple linear regression, Random Forest) that predict the targets. We apply our framework to the statistical intensity prediction of Western Pacific Tropical Cyclones (TC), for which it is often difficult to accurately choose drivers and their dimensionality reduction (time lags, vertical levels, and area-averaging). Using more stringent significance thresholds in the conditional independence tests helps eliminate spurious causal relationships, thus helping the ML model generalize better to unseen TC cases. M-PC1 with a reduced number of features outperforms M-PCMCI, non-causal ML, and other feature selection methods (lagged correlation, random), even slightly outperforming feature selection based on eXplainable Artificial Intelligence. The optimal causal drivers obtained from our causal feature selection help improve our understanding of underlying relationships and suggest new potential drivers of TC intensification.

arxiv情報

著者 Saranya Ganesh S.,Tom Beucler,Frederick Iat-Hin Tam,Milton S. Gomez,Jakob Runge,Andreas Gerhardus
発行日 2023-04-12 10:24:40+00:00
arxivサイト arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, OpenAI

カテゴリー: cs.LG, physics.ao-ph, physics.comp-ph, stat.ML パーマリンク