A New Approach to Define the Number of Clusters for Partitional Clustering Algorithms
محورهای موضوعی : Transactions on Fuzzy Sets and SystemsHuliane Silva 1 , Benjamın Ren Callejas Bedregal 2 , Anne Canuto 3 , Thiago Batista 4 , Ronildo Moura 5
1 - Departamento de Engenharias e Tecnologias, Universidade Federal Rural do Semi-rido, Pau dos Ferros, Brasil.
2 - Departamento de Inform´atica e Matem´atica Aplicada, Universidade Federal do Rio Grande do Norte, Natal, Brasil.
3 - Departamento de Inform´atica e Matem´atica Aplicada, Universidade Federal do Rio Grande do Norte, Natal, Brasil.
4 - Departamento de Inform´atica e Matem´atica Aplicada, Universidade Federal do Rio Grande do Norte, Natal, Brasil.
5 - Departamento de Inform´atica e Matem´atica Aplicada, Universidade Federal do Rio Grande do Norte, Natal, Brasil.
کلید واژه: Partitional clustering algorithms, Clustering fuzzy, Number of cluster,
چکیده مقاله :
Data clustering consists of grouping similar objects according to some characteristic. In the literature, there are several clustering algorithms, among which stands out the Fuzzy C-Means (FCM), one of the most discussed algorithms, being used in different applications. Although it is a simple and easy to manipulate clustering method, the FCM requires as its initial parameter the number of clusters. Usually, this information is unknown, beforehand and this becomes a relevant problem in the data cluster analysis process. In this context, this work proposes a new methodology to determine the number of clusters of partitional algorithms, using subsets of the original data in order to define the number of clusters. This new methodology, is intended to reduce the side effects of the cluster definition phase, possibly making the processing time faster and decreasing the computational cost. To evaluate the proposed methodology, different cluster validation indices will be used to evaluate the quality of the clusters obtained by the FCM algorithms and some of its variants, when applied to different databases. Through the empirical analysis, we can conclude that the results obtained in this article are promising, both from an experimental point of view and from a statistical point of view.
Data clustering consists of grouping similar objects according to some characteristic. In the literature, there are several clustering algorithms, among which stands out the Fuzzy C-Means (FCM), one of the most discussed algorithms, being used in different applications. Although it is a simple and easy to manipulate clustering method, the FCM requires as its initial parameter the number of clusters. Usually, this information is unknown, beforehand and this becomes a relevant problem in the data cluster analysis process. In this context, this work proposes a new methodology to determine the number of clusters of partitional algorithms, using subsets of the original data in order to define the number of clusters. This new methodology, is intended to reduce the side effects of the cluster definition phase, possibly making the processing time faster and decreasing the computational cost. To evaluate the proposed methodology, different cluster validation indices will be used to evaluate the quality of the clusters obtained by the FCM algorithms and some of its variants, when applied to different databases. Through the empirical analysis, we can conclude that the results obtained in this article are promising, both from an experimental point of view and from a statistical point of view.
[1] Carvalho A, Faceli K, Lorena A, Gama J. Inteligência Artificial–uma Abordagem de Aprendizado de Máquina. Rio de Janeiro: LTC; 2011.
[2] Hair JF, Black WC, Babin BJ, Anderson RE, Tatham RL. Análise Multivariada de Dados. Porto Alegre: Bookman; 2009.
[3] Bong CW, Rajeswari M. Multi-objective nature-inspired clustering and classification techniques for image segmentation. Applied soft computing. 2011; 11(4): 3271-3282. DOI: https://doi.org/10.1016/j.asoc.2011.01.014
[4] Sun H, Wang S, Jiang Q. FCM-based model selection algorithms for determining the number of clusters. Pattern recognition. 2004; 37(10): 2027-2037. DOI: https://doi.org/10.1016/j.patcog.2004.03.012
[5] Sentelle C, Hong SL, Anagnostopoulos GC, Georgiopoulos M. A Fuzzy GAP statistic for Fuzzy c-means. In: International Conference Artificial Intelligence and Soft Computing. 2007.
[6] Kodinariya TM, Makwana PR. Review on determining number of cluster in K-means clustering. International Journal. 2013; 1(6): 90-95. DOI:10.18576/amis/100428
[7] Xu S, Hu L, Yang X, Liu X. A cluster number adaptive Fuzzy c-means algorithm for image segmentation. International Journal of Signal Processing, Image Processing and Pattern Recognition. 2013; 6(5): 191- 204. DOI: http://dx.doi.org/10.14257/ijsip.2013.6.5.17
[8] Ren M, Liu P, Wang Z, Yi J. A self-adaptive Fuzzy c-means algorithm for determining the optimal number of clusters. Computational intelligence and neuroscience. 2016; 2016(3): 1-12. DOI: https://doi.org/10.1155/2016/2647389
[9] Ünlü R, Xanthopoulos P. Estimating the number of clusters in a dataset via consensus clustering. Expert Systems with Applications. 2019; 125: 33-39. DOI: https://doi.org/10.1016/j.eswa.2019.01.074
[10] Jain AK, Murty MN, Flynn PJ. Data clustering: A review.ACM computing surveys (CSUR). 1999; 31(3): 264-323. DOI: https://doi.org/10.1145/331499.331504
[11] Wang W, Zhang Y. On Fuzzy cluster validity indices. Fuzzy sets and systems. 2007; 158(19): 2095-2117. DOI: https://doi.org/10.1016/j.fss.2007.03.004
[12] Xing R, Li C. Fuzzy c-means algorithm automatically determining optimal number of clusters. Computers, Materials and Continua. 2019; 60(2): 767-780. DOI: https://doi.org/10.32604/cmc.2019.04500
[13] Yejun X. Optimization of the clusters number of an improved fuzzy C-means clustering algorithm. In: 2015 10th International Conference on Computer Science & Education (ICCSE). 2015. p.931-935. DOI: 10.1109/ICCSE.2015.7250383
[14] Dinh DT, Fujinami T, Huynh VN. Estimating the optimal number of clusters in categorical data clustering by silhouette coefficient. In: Knowledge and Systems Sciences. 2019. p.1-17. DOI: https://doi.org/10.1007/978-981-15-1209-4_1
[15] Liao HY, Ng MK. Categorical data clustering with automatic selection of cluster number. Fuzzy Information and Engineering. 2009; 1(1): 5-25. DOI: https://doi.org/10.1007/s12543-009-0001-5
[16] Havens TC, Bezdek JC, Palaniswami M. Cluster validity for kernel Fuzzy clustering. In: 2012 IEEE International Conference on Fuzzy Systems. 2012. p.1-8. DOI: https://doi.org/10.1109/FUZZIEEE.2012.6250820
[17] Markelle K, Rachel L, Kolby N. The UCI Machine Learning Repository, https://archive.ics.uci.edu
[18] Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS. Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. In: Proceedings of the annual symposium on computer application in medical care. 1988. p.261.
[19] Bezdek JC. Pattern Recognition with Fuzzy Objective Function Algorithms. Norwell, MA, USA: Kluwer Academic Publishers; 1981.
[20] Dunn JC. A Fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. Journal of Cybernetics. 1973; 3(3): 32-57. DOI: https://doi.org/10.1080/01969727308546046
[21] Bedregal BR. A comparative study between fuzzy c-means and ckmeans algorithms. In: 2010 Annual Meeting of the North American Fuzzy Information Processing Society. 2010. p.1-6. DOI: https://www.doi.org/10.1109/NAFIPS.2010.5548194
[22] Tsai DM, Lin CC. Fuzzy C-means based clustering for linearly and nonlinearly separable data. Pattern recognition. 2011; 44(8): 1750-1760. DOI: https://doi.org/10.1016/j.patcog.2011.02.009
[23] Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm. Computers & geosciences. 1984; 10(2-3): 191-203. DOI: https://doi.org/10.1016/0098-3004(84)90020-7
[24] Vargas RR, Bedregal BR, Palmeira ES. A comparison between k-means, fcm and ckmeans algorithms. In: 2011 Workshop-School on Theoretical Computer Science. 2011. p.32-38. DOI: https://doi.org/10.1109/WEIT.2011.28
[25] Vargas RR, Freddo R, Galafassi C, Gass SL, Russini A, Bedregal B. Identifying pixels classified uncertainties ckmeansimage algorithm. In: Information Processing and Management of Uncertainty in Knowledge-Based Systems. 2018. p.429-440. DOI:10.1007/978-3-319-91479-4_36
[26] Jain AK, Dubes RC. Algorithms for Clustering Data. Prentice-Hall; 1988.
[27] Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques. Journal of intelligent information systems. 2001; 17(2-3): 107-145. DOI: https://doi.org/10.1023/A:1012801612483
[28] Peres SM, Rocha T, Bíscaro HH, Madeo RC, Boscarioli C. Tutorial sobre Fuzzy-c-means e Fuzzy learning vector Quantization: Abordagens híbridas para tarefas de agrupamento e classificação. Revista de Informática Teórica e Aplicada. 2012; 19(1): 120-163. DOI: https://doi.org/10.22456/2175-2745.13764
[29] Dave RN. Validating Fuzzy partitions obtained through c-shells clustering. Pattern recognition letters. 1996; 17(6): 613-623. DOI: https://doi.org/10.1016/0167-8655(96)00026-8
[30] Fukuyama Y, Sugeno M. A new method of choosing the number of clusters for the Fuzzy c-mean method. The 5th Fuzzy Syst Symp. 1989; 247: 247-250.
[31] Bezdek JC, Pal NR. Some new indexes of cluster validity. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics). 1998; 28(3): 301-315. DOI: https://doi.org/10.1109/3477.678624
[32] da Silva LR, Arnaldo HA, da Silva H, Moura R, Bedregal B, de P Canuto AM. Two deterministic selection methods for the initial centers in Fuzzy c-means based algorithms. Intelligent Data Analysis. 2020; 24(4): 779-798. DOI: https://doi.org/10.3233/IDA-194588
[33] Silva L, Moura R, Canuto AM, Santiago RH, Bedregal B. An interval-based framework for ffuzzy clustering applications. IEEE Transactions on Fuzzy Systems. 2015; 23(6): 2174-2187. DOI: https://doi.org/10.1109/TFUZZ.2015.2407901
[34] Silva HM, Canuto AM, Medeiros IG, Xavier-Júnior JC. Cluster ensembles optimization using coral reefs optimization algorithm. In: Artificial Neural Networks and Machine Learning–ICANN 2016: 25th International Conference on Artificial Neural Networks. 2016; p.275-282. DOI: https://doi.org/10.1007/978- 3-319-44781-0_33