Scalable Fuzzy Decision Tree Induction Using Fast Data Partitioning and Incremental Approach for Large Dataset
محورهای موضوعی : Data MiningSomayeh Lotfi 1 , Mohammad Ghasemzadeh 2 , Mehran Mohsenzadeh 3 , Mitra Mirzarezaee 4
1 - Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran, Iran
2 - Computer Engineering Department, Yazd University, Yazd, Iran.
3 - Department of Computer Engineering, Science and Research Branch, Islamic Azad University, Tehran ,IRAN
4 - Department of Computer Engineering, Science and Research Branch, Islamic Azad University,
Daneshgah Blvd., Simon Bulivar Blvd., P.O. Box: 14515/775, Tehran, Iran
کلید واژه: Large dataset, Fuzzy Entropy, Fuzzy Decision trees, Fuzzy partitioning,
چکیده مقاله :
The decision tree is one of the popular methods for learning and reasoning through recursive partitioning of data space. To choose the best attribute in the case on numerical features, partitioning criteria should be calculated for individual values or the value range of each attribute should be divided into two or more intervals using a set of cut points. In partitioning range of attribute, the fuzzy partitioning can be used to reduce the noise sensitivity of data and to increase the stability of decision trees. Since the tree-building algorithms need to keep in main memory the whole training dataset, they have memory restrictions. In this paper, we present an algorithm that builds the fuzzy decision tree on the large dataset. In order to avoid storing the entire training dataset in main memory and overcome the memory limitation, the algorithm builds DTs in an incremental way. In the discretization stage, a fuzzy partition was generated on each continuous attribute based on fuzzy entropy. Then, in order to select the best feature for branches, two criteria, including fuzzy information gain and occurrence matrix are used. Besides, real datasets are used to evaluate the behavior of the algorithm in terms of classification accuracy, decision tree complexity, and execution time as well. The results show that proposed algorithm without a need to store the entire dataset in memory and reduce the complexity of the tree is able to overcome the memory limitation and making balance between accuracy and complexity .
[1] Khakata, E., Omwenga, V. and Msanjila, S., 2020. Prediction of Student Learning Styles using Data Mining Techniques. Journal of Advances in Computer Engineering and Technology, 6(2), pp.107-118.
[2] Goetz, T., 2010. The decision tree: taking control of your health in the new era of personalized medicine. Rodale.
[3] Han, J., Pei, J. and Kamber, M., 2011. Data mining: concepts and techniques. Elsevier.
[4] Rokach, L. and Maimon, O.Z., 2008. Data mining with decision trees: theory and applications (Vol. 69). World scientific.
[5] Quinlan, J.R., 2014. C4. 5: programs for machine learning. Elsevier.
[6] Breiman, L., Friedman, J., Stone, C.J. and Olshen, R.A., 1984. Classification and regression trees. CRC press.
[7] Kotsiantis, S.B., 2013. Decision trees: a recent overview. Artificial Intelligence Review, 39(4), pp.261-283.
[8] Fayyad, U. and Irani, K., 1993. Multi-interval discretization of continuous-valued attributes for classification learning.
[9] Fayyad, U.M. and Irani, K.B., 1992. On the handling of continuous-valued attributes in decision tree generation. Machine learning, 8(1), pp.87-102.
[10] Zheng, H., He, J., Zhang, Y., Huang, G., Zhang, Z. and Liu, Q., 2019. A general model for fuzzy decision tree and fuzzy random forest. Computational Intelligence, 35(2), pp.310-335.
[11] Yu, H., Lu, J. and Zhang, G., 2017, November. Learning a fuzzy decision tree from uncertain data. In 2017 12th International Conf. on Intelligent Systems and Knowledge Engineering (pp. 1-7).
[12] Hishamuddin, M.N.F., Hassan, M.F. and Mokhtar, A.A., 2020, February. Improving Classification Accuracy of Random Forest Algorithm Using Unsupervised Discretization with Fuzzy Partition and Fuzzy Set Intervals. In Proceedings of the 2020 9th International Conference on Software and Computer Applications (pp. 99-104).
[13] Garcia, S., Luengo, J., Sáez, J.A., Lopez, V. and Herrera, F., 2012. A survey of discretization techniques: Taxonomy and empirical analysis in supervised learning. IEEE Trans. on Knowledge and Data Engineering, 25(4), pp.734-750.
[14] Zeinalkhani, M. and Eftekhari, M., 2014. Fuzzy partitioning of continuous attributes through discreti-zation methods to construct fuzzy decision tree classifiers. Information Sciences, 278, pp.715-735.
[15] Lomax, S. and Vadera, S., 2013. A survey of cost-sensitive decision tree induction algorithms. ACM Computing Surveys (CSUR), 45(2), pp.1-35.
[16] Lomax, S. and Vadera, S., 2013. A survey of cost-sensitive decision tree induction algorithms. ACM Computing Surveys (CSUR), 45(2), pp.1-35.
[17] Mehta, M., Agrawal, R. and Rissanen, J., 1996, March. SLIQ: A fast scalable classifier for data mining. In International conference on extending database technology (pp. 18-32). Springer, Berlin, Heidelberg.
[18] Hulten, G. and Domingos, P., 2016. Mining Decision Trees from Streams. In Data Stream Management (pp. 189-208). Springer, Berlin, Heidelberg.
[19] Segatori, A., Marcelloni, F. and Pedrycz, W., 2017. On distributed fuzzy decision trees for big data. IEEE Transactions on Fuzzy Systems, 26(1), pp.174-192.
[20] Mahmud, M.S., Huang, J.Z., Salloum, S., Emara, T.Z. and Sadatdiynov, K., 2020. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining and Analytics, 3(2), pp.85-101.
[21] Franco-Arcega, A., Carrasco-Ochoa, J.A., Sánchez-Díaz, G. and Martínez-Trinidad, J.F., 2011. Decision tree induction using a fast splitting attribute selection for large datasets. Expert Systems with Applications, 38(11), pp.14290-14300.
[22] Bahri, M., Pfahringer, B., Bifet, A. and Maniu, S., 2020, April. Efficient batch-incremental classification using umap for evolving data streams. In International Symposium on Intelligent Data Analysis (pp. 40-53). Springer, Cham.
[23] Couso, I., Borgelt, C., Hullermeier, E. and Kruse, R., 2019. Fuzzy sets in data analysis: From statistical foundations to machine learning. IEEE Comput-ational Intelligence Magazine, 14(1), pp.31-44.
[24] Peng, Y. and Flach, P., 2001. Soft discretization to enhance the continuous decision tree induction. Integrating Aspects of Data Mining, Decision Support and Meta-Learning, 1(34), pp.109-118.
[25] Chen, Y.L., Wang, T., Wang, B.S. and Li, Z.J., 2009. A survey of fuzzy decision tree classifier. Fuzzy Information and Engineering, 1(2), pp.149-159.
[26] Dong, M. and Kothari, R., 2001. Look-ahead based fuzzy decision tree induction. IEEE Transactions on fuzzy systems, 9(3), pp.461-468.
[27] Wang, X. and Borgelt, C., 2004, July. Information measures in fuzzy decision trees. In 2004 IEEE International Conference on Fuzzy Systems (IEEE Cat. No. 04CH37542) (Vol. 1, pp. 85-90). IEEE.
[28] Wang, X., Liu, X., Pedrycz, W. and Zhang, L., 2015. Fuzzy rule based decision trees. Pattern Recognition, 48(1), pp.50-59.
[29] Grossi, V., Romei, A. and Turini, F., 2017. Survey on using constraints in data mining. Data mining and knowledge discovery, 31(2), pp.424-464.
[30] Afify, A.A., 2016. A fuzzy rule induction algorithm for discovering classification rules. Journal of Intelligent & Fuzzy Systems, 30(6), pp.3067-3085.
[31] Jin, C., Li, F. and Li, Y., 2014. A generalized fuzzy ID3 algorithm using generalized information entropy. Knowledge-Based Systems, 64, pp.13-21.