+ Site Statistics
+ Search Articles
+ Subscribe to Site Feeds
Most Shared
PDF Full Text
+ PDF Full Text
Request PDF Full Text
+ Follow Us
Follow on Facebook
Follow on Twitter
Follow on LinkedIn
+ Translate
+ Recently Requested

Evaluation of stability of k-means cluster ensembles with respect to random initialization

Evaluation of stability of k-means cluster ensembles with respect to random initialization

IEEE Transactions on Pattern Analysis and Machine Intelligence 28(11): 1798-1808

Many clustering algorithms, including cluster ensembles, rely on a random component. Stability of the results across different runs is considered to be an asset of the algorithm. The cluster ensembles considered here are based on k-means clusterers. Each clusterer is assigned a random target number of clusters, k and is started from a random initialization. Here, we use 10 artificial and 10 real data sets to study ensemble stability with respect to random k, and random initialization. The data sets were chosen to have a small number of clusters (two to seven) and a moderate number of data points (up to a few hundred). Pairwise stability is defined as the adjusted Rand index between pairs of clusterers in the ensemble, averaged across all pairs. Nonpairwise stability is defined as the entropy of the consensus matrix of the ensemble. An experimental comparison with the stability of the standard k-means algorithm was carried out for k from 2 to 20. The results revealed that ensembles are generally more stable, markedly so for larger k. To establish whether stability can serve as a cluster validity index, we first looked at the relationship between stability and accuracy with respect to the number of clusters, k. We found that such a relationship strongly depends on the data set, varying from almost perfect positive correlation (0.97, for the glass data) to almost perfect negative correlation (-0.93, for the crabs data). We propose a new combined stability index to be the sum of the pairwise individual and ensemble stabilities. This index was found to correlate better with the ensemble accuracy. Following the hypothesis that a point of stability of a clustering algorithm corresponds to a structure found in the data, we used the stability measures to pick the number of clusters. The combined stability index gave best results.

(PDF emailed within 0-6 h: $19.90)

Accession: 048999513

Download citation: RISBibTeXText

PMID: 17063684

DOI: 10.1109/TPAMI.2006.226

Related references

Stability of quantum statistical ensembles with respect to local measurements. Physical Review. E 94(6-1): 062106, 2017

Mixtures of regression models for time course gene expression data: evaluation of initialization and random effects. Bioinformatics 26(3): 370-377, 2010

Adaptive Bi-Weighting Toward Automatic Initialization and Model Selection for HMM-Based Hybrid Meta-Clustering Ensembles. IEEE Transactions on Cybernetics 2018, 2018

Two-body random ensembles: from nuclear spectra to random polynomials. Physical Review Letters 85(18): 3773-3776, 2000

Stability of CN intermediate-term earthquake predictions with respect to random errors in magnitude; the case of central Italy. Pages 301 2001, 2001

Stability of intermediate-term earthquake predictions with respect to random errors in magnitude; the case of central Italy. Physics of the Earth and Planetary Interiors 130(1-2): 117-127, 2002

Spectroscopy with random and displaced random ensembles. Physical Review Letters 88(7): 072502, 2002

A novel weight initialization method for the random neural network. Neurocomputing 73(1-3): 160-168, 2009

The DTC Ensembles Task: A New Testing and Evaluation Facility for Mesoscale Ensembles. Bulletin of the American Meteorological Society 94(3): 321-327, 2013

Efficient Subversion of Symmetric Encryption with Random Initialization Vector. Ieice Transactions on Information and Systems E99.D(4): 1251-1254, 2016

Initialization of Markov random field clustering of large remote sensing images. IEEE Transactions on Geoscience and Remote Sensing 43(8): 1912-1919, 2005

Robust facial landmark localization using classified random ferns and pose-based initialization. Signal Processing 110: 46-53, 2015

Evaluation of photoanodic output on carbon cluster/phthalocyanine films with respect to the types of n-type conductors employed. Journal of Materials Science 47(2): 1071-1076, 2012

Localizing landmark sets in head CTs using random forests and a heuristic search algorithm for registration initialization. Journal of Medical Imaging 4(4): 044007, 2017

The evaluation of the root system of mustard by means of its di electric characters with respect to the yield. Biologia Plantarum 18(1): 44-49, 1976