Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

要約

業界全体での人工知能 (AI) の導入により、意思決定のための複雑なブラックボックスモデルと解釈ツールが広く使用されるようになりました。
この論文では、特に部分依存 (PD) プロットに焦点を当てて、機械学習タスクの順列ベースの解釈方法の脆弱性を明らかにするための敵対的フレームワークを提案します。
この敵対的フレームワークは、元のブラックボックスモデルを変更して、外挿ドメイン内のインスタンスの予測を操作します。
その結果、元のモデルの予測のほとんどを維持しながら、差別的な行動を隠蔽できる欺瞞的な PD プロットが生成されます。
このフレームワークは、単一のモデルを介して複数の騙された PD プロットを生成できます。
自動車保険金請求データセットや COMPAS (代替制裁のための矯正犯罪者管理プロファイリング) データセットを含む現実世界のデータセットを使用することで、予測子の差別的行動を意図的に隠し、ブラックボックスモデルを中立的に見せることが可能であることを私たちの結果は示しています。
元のブラックボックスモデルのほぼすべての予測を保持しながら、PD プロットなどの解釈ツールを使用します。
調査結果に基づいて、規制当局と実務者向けに経営上の洞察が提供されます。

要約(オリジナル)

The adoption of artificial intelligence (AI) across industries has led to the widespread use of complex black-box models and interpretation tools for decision making. This paper proposes an adversarial framework to uncover the vulnerability of permutation-based interpretation methods for machine learning tasks, with a particular focus on partial dependence (PD) plots. This adversarial framework modifies the original black box model to manipulate its predictions for instances in the extrapolation domain. As a result, it produces deceptive PD plots that can conceal discriminatory behaviors while preserving most of the original model’s predictions. This framework can produce multiple fooled PD plots via a single model. By using real-world datasets including an auto insurance claims dataset and COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) dataset, our results show that it is possible to intentionally hide the discriminatory behavior of a predictor and make the black-box model appear neutral through interpretation tools like PD plots while retaining almost all the predictions of the original black-box model. Managerial insights for regulators and practitioners are provided based on the findings.

arxiv情報

著者	Xi Xin,Fei Huang,Giles Hooker
発行日	2024-04-29 13:51:41+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Why You Should Not Trust Interpretations in Machine Learning: Adversarial Attacks on Partial Dependence Plots

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー