Expectation Distance-based Distributional Clustering for Noise-Robustness

要約

この論文では、データ分布を学習してクラスタリングし、データをその分布のクラスタに割り当てることによって、データノイズの影響を受けにくくするクラスタリング手法を紹介します。
その過程で、クラスタリングの結果に対するノイズの影響が軽減されます。
この方法では、分布間の新しい距離、つまり期待距離 (ED と表示) を導入する必要があります。これは、最適な物質輸送の最先端の分布距離 ($2$-Wasserstein に対して $W_2$ と表示) を超えています。
は基本的に周辺分布のみに依存しますが、前者は結合分布に関する情報も使用します。
この論文では、ED を使用して、従来の $K$-means および $K$-medoids クラスタリングを (生データではなく) データ分布に拡張し、$W_2$ を使用して $K$-medoids を導入しています。
この論文では、$W_2$ および ED 距離尺度の閉じた形式の表現も示しています。
提案された ED の実装結果と、実世界の気象データと在庫データをクラスター化するための $W_2$ 距離測定も提示されます。これには、基礎となるデータ分布 (気象データのガウス分布と在庫の対数正規分布) を効率的に抽出して使用することが含まれます。
データ。
結果は、生データの従来のクラスタリングよりもパフォーマンスが大幅に向上し、ED でより高い精度が実現されたことを示しています。
また、分布ベースのクラスタリングは精度が高いだけでなく、時間の複雑さが軽減されるため、計算時間も短縮されます。

要約(オリジナル)

This paper presents a clustering technique that reduces the susceptibility to data noise by learning and clustering the data-distribution and then assigning the data to the cluster of its distribution. In the process, it reduces the impact of noise on clustering results. This method involves introducing a new distance among distributions, namely the expectation distance (denoted, ED), that goes beyond the state-of-art distribution distance of optimal mass transport (denoted, $W_2$ for $2$-Wasserstein): The latter essentially depends only on the marginal distributions while the former also employs the information about the joint distributions. Using the ED, the paper extends the classical $K$-means and $K$-medoids clustering to those over data-distributions (rather than raw-data) and introduces $K$-medoids using $W_2$. The paper also presents the closed-form expressions of the $W_2$ and ED distance measures. The implementation results of the proposed ED and the $W_2$ distance measures to cluster real-world weather data as well as stock data are also presented, which involves efficiently extracting and using the underlying data distributions — Gaussians for weather data versus lognormals for stock data. The results show striking performance improvement over classical clustering of raw-data, with higher accuracy realized for ED. Also, not only does the distribution-based clustering offer higher accuracy, but it also lowers the computation time due to reduced time-complexity.

arxiv情報

著者	Rahmat Adesunkanmi,Ratnesh Kumar
発行日	2023-03-14 17:50:47+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Expectation Distance-based Distributional Clustering for Noise-Robustness

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー