Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

要約

ビッグデータ時代において、各アルゴリズムに必要な重要な機能は、分散環境で効率的に並行して実行できる可能性です。
残念ながら、クラスタリングの品質を評価する人気のあるシルエットメトリックには、このプロパティがなく、入力データセットのサイズに関して 2 次の計算複雑性があります。
このため、クラスタリングを別の方法で評価する必要があるビッグデータシナリオでは、その実行が妨げられてきました。
このギャップを埋めるために、このホワイトペーパーでは、線形の複雑さでシルエットメトリックを計算し、分散環境で簡単に並列実行できる最初のアルゴリズムを紹介します。
その実装は、Apache Spark ML ライブラリで自由に利用できます。

要約(オリジナル)

In the big data era, the key feature that each algorithm needs to have is the possibility of efficiently running in parallel in a distributed environment. The popular Silhouette metric to evaluate the quality of a clustering, unfortunately, does not have this property and has a quadratic computational complexity with respect to the size of the input dataset. For this reason, its execution has been hindered in big data scenarios, where clustering had to be evaluated otherwise. To fill this gap, in this paper we introduce the first algorithm that computes the Silhouette metric with linear complexity and can easily execute in parallel in a distributed environment. Its implementation is freely available in the Apache Spark ML library.

arxiv情報

著者	Marco Gaido
発行日	2023-03-24 16:10:43+00:00
arxivサイト	arxiv_id(pdf)

提供元, 利用サービス

arxiv.jp, Google

Distributed Silhouette Algorithm: Evaluating Clustering on Big Data

要約

要約(オリジナル)

arxiv情報

提供元, 利用サービス

最近の投稿

最近のコメント

アーカイブ

カテゴリー