Improve Spam Detection in the Internet Using Feature Selection based on the Metahuristic Algorithms

Ghaderzadeh, Abdulbaghi; Hosseinpanahi, sahar; Taher kareem, Sarkhel

Manuscript ID : JACET-2105-1469 (R1) Visit : 175 Page: 115 - 125

Article Type: Original Research

Improve Spam Detection in the Internet Using Feature Selection based on the Metahuristic Algorithms

Subject Areas : Evolutionary Computing

Abdulbaghi Ghaderzadeh ¹ , sahar Hosseinpanahi ² , Sarkhel Taher kareem ³

1 - Department of Computer Engineering, Sanandaj Branch, Islamic Azad University, Sanandaj, Iran
2 - Department of Computer Engineering, Sanandaj Branch, Islamic Azad University, Sanandaj, Iran
3 - Computer Department, College of Science, University of Halabja, Halabja, Iraq. Sulaimani Polytechnic University, Technical College of Informatics,Computer Networks Department, Sulaimani, Iraq.

Received: 2021-05-04 Accepted : 2021-12-28 Published : 2021-05-01

Keywords:

Abstract :

References:

Full-Text:

Improve Spam Detection in the Internet Using Feature Selection based on the Metahuristic Algorithms

Sahar Hosseinpanahia, Abdulbaghi Ghaderzadeha *¹, Sarkhel H.Taher Karimb c

a Department of Computer Engineering, Sanandaj Branch, Islamic Azad University, Sanandaj, Iran.

b Computer Department, College of Science, University of Halabja, Halabja, Iraq.

c Sulaimani Polytechnic University, Technical College of Informatics,Computer Networks Department, Sulaimani, Iraq.

Abstract:

Nowadays, spam is a major challenge regarding emails. Spam is a specific type of email that is sent to the network for malicious purposes. Spam plays an important role in stealing information and can include fake links to trick users. Machine learning and data mining techniques such as artificial neural networks are the most applicable methods to detect spam. The multi-layer artificial neural network needs to select the most important features as inputs to reduce the output error for accurate spam detection. In the proposed method, a smart method based on swarm intelligence algorithms is used for feature selection. In this study, a binary version of Emperor Penguin Optimizer (EPO) is used to select more appropriate features. The proposed method uses the selected features for learning and classification in the spam detection process. Experiments in the MATLAB environment on the Spambase dataset show that with the increase in population the error in spam detection in Emails will decrease about 14.61% and with the increase in feature space, it will decrease about 43.85% in the best situation. Experiments show that the proposed method has less error in detecting spam compare to other methods, multilayer artificial neural network, recursive neural network, support vector machine, Bayesian network, and whale optimization algorithm. Experiments show that the error of spam detection in the proposed approach is about 23.57% less than the whale optimization algorithm. Empirical results, obtained through simulations on the Spambase dataset, show our approach outperforms the other existing methods on precision value.

Keywords: Spam detection, Feature selection, Metaheuristic algorithms, Emperor Penguin Optimizer (EPO)

1. Introduction

Much of the communication between users in cyberspace and the internet is done through email. Email services help users to send their text and message to other network users. You can also post links or images in the email or include attachments of a file type. Nowadays, spam is a major challenge regarding emails[1]. Spam is a specific type of email that is sent to the network for various purposes. One of these malicious purposes can be intimidating for users. In some cases, the only purpose is to advertise a product, in which case a large number of emails are sent in bulk to different people. spam can be used to steal information, in this case, fake sites are created on the Internet, and links of these pages redirected the users to fake websites to steal their Information [2]. Spam can have various types of malware such as viruses, worms, and Trojans. In this case, the victim system is infected and can expose users to attacks. Today, in most cyber-attacks a virus is sent by spam and fake emails to a large number of users on the network. In this case, each user and system can be considered as part of the attack [3]. Since spam is an important and key challenge, various methods have been proposed, including blacklisting techniques [4], heuristics approaches [5], and knowledge discovery [6]. Each of these methods can be used to detect spam, but each performs a specific mechanism. The blacklist method can filter out spam senders, but it is not efficient because it will increase the list size over time. The spam sender's address is constantly changing. As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables, exploration methods is used to identify systematic relations between variables when there are no (or not complete) a priori expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns. Computational Exploration methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets. Exploration methods use speculation and are not very accurate, but knowledge discovery as a knowledge extraction process can uncover hidden patterns of spam and can be greatly enhanced by data-based learning and machine learning can separate spam from email. So far, methods have been developed to detect spam based on knowledge discovery, such as regression [7], artificial neural network [8], decision tree [9], random forest [10], and support vector machine [11].

In general, each of the emails and spam has a set of features that can help to identify them. One of the important and effective ways to distinguish spam from regular email is to use knowledge discovery techniques such as data mining and machine learning. These techniques are the smart ways to classify spam from regular emails. Machine learning methods are very effective in identifying spam by searching for the hidden pattern of spam in the received emails. Therefore, in this study, we attempt to provide a suitable method for detecting spam from ordinary emails using knowledge-based methods such as data mining and machine learning as suitable techniques for pattern recognition and Classification [12].

One of the major challenges in email classification methods from spam is the output error of these models. We can use meta-heuristic algorithms to minimize the errors. This paper presents a hybrid approach for detecting spam from regular email using artificial neural networks and meta-heuristic algorithms.

The outline of the paper is as follows. In Section 2, a summary of related works are presented. In Section 3, the proposed approach and the core algorithmic contributions of the paper are described , while Section 4 contains an experimental analysis, the details of implementing the proposed approach and the discussion of the experimental evaluations are presented. In Section 5, Conclusions and future suggestions are presented.

2. Related Works

The global Internet network is expanding with a growing number of users, and the role of smartphones in connecting to this network is significant. The Internet network offers a variety of communication services to users, email is one of the most useful and attractive of them. The major advantage of the email service is that it is free and easy to use, so that most online users use this web service for their daily communication and correspondence. Email service, like many web services, also has spam challenges that can pose a disadvantage to users. Spam can be considered as emails that are sent to a large number of people without the users consent, usually in bulk, with the purpose of sending them to be marketing, advertising, entertainment, information theft, harassment and more. The most dangerous type of spam can be found in information theft, in which a hacker or fisher sends deceptive emails to users using social engineering techniques and calls them to fake sites to steal valuable information from users. The disadvantages of spam are not limited to stealing information, as these types of emails waste a lot of time and even occupy processing resources for data centers and internet traffic. Identifying and detecting spam can be the most important step in filtering out these annoying emails. Spam typically has a set of features that make it possible to identify them by regular email. For example, in emails where the dollar or $ character is repeated over and over, it can be a sign of deceptive spam. Detecting spam from regular emails by a set of features of an email that can be used effectively can be effective. It is possible to identify hidden patterns in spam and extract their knowledge by methods such as machine learning and data mining. Spam patterns can be identified by various data mining techniques such as artificial neural network and based on these patterns, the incoming emails can be classified into spam and regular emails.

Singh et al., 2018 [13], presented a new method for detecting spam in the Internet of Things. This research examines how Internet of Things smart objects can publish their data to social networks such as Twitter and Facebook, and how to detect spam-based information in this communication space. The results of their studies show that the flow of data generated from intelligent objects and disseminated on social networks creates data of a large data type, whose analysis requires an appropriate platform for processing this volume of data. In this study, a semi-surveillance technique for spam detection on Twitter was developed using a framework based on classification techniques such as nearest neighbor, regression and Bayesian network. In their proposed method, learning is based on information about malicious URLs, spam users' information and spam text. Implementation and simulation results with synthetic data shows that the proposed system can accurately detect spam in the context of the IoT and social networks.

Yuancheng et al., 2018 [14], used the concepts of deep belief networks to detect spam on the web. In this study, they also incorporated the concepts of deep belief networks with deep learning methods such as SMOTE and DAE algorithms to increase their accuracy in spam detection. Their implementation results shows that this method is more accurate in detecting spam than techniques such as backup vector machine and random forest.

Ruano et al., 2018 [15], presented an application mechanism for detecting spam using evolutionary methods. In their research for spam detection, they used the mechanism of genetic programming algorithm to create regular expressions in spam detection. Their research results shows that the rate of spam that is misdiagnosed in their proposed method is lower than that of backup vector machine and Bayesian network.

Chen et al., 2018 [16], presented a new method of semi-supervised learning to detect spam in this popular Chinese network, due to the spam challenge on the Weibo social network. In their proposed approach, they used a semi-supervised learning framework by integrating information to detect spam, in which they proposed a number of users who send spam and a number of other users whose type of spam whether or not to send spam is unclear. The results of their research have shown that the proposed method performs better in spam detection on Weibo social network than learning techniques such as random forest.

Kumal and Yadav in 2019 [17] used distributed processing and learning methods based on Hadoop technology to detect large-scale spam. Online surveys are the easiest sources of free information used by organizations and customers to make decisions and can be used to counter spam. Nowadays, most organizations employ well-known and knowledgeable spam specialists with two important goals: the first purpose of promoting their products and the second purpose they are trying to make positive comments about their products and against negative opinions about competing product for this reason they are constantly sending spam. In fact, spamming in the comments section has been an important challenge in the web world, and so far spam and related spam detection in the comments section of a site has been classified as a discrete problem and generally regarded as spam and non-spam; in this research using fuzzy logic discrete state is considered to be a fuzzy state between two states that can be spam or email. Because fuzzy logic well solves real-world uncertainty, this study presents a solution based on a new fuzzy model for the problem of spam detection. In this study, four linguistic variables of fuzzy input are proposed and the suspicious level of a spam group is mapped to one of infinite, super, mega, safe, very, medium, small and weak. In this study, 81 fuzzy rules and fuzzy ranking evaluation algorithm are used to determine the suspiciousness of a set of ideas. Since the spam dataset has a large volume of users comments in practice, distributed Hadoop processing has been used to accelerate learning. Their implementation results on the Amazon dataset shows that their method is about 80.77% accurate, which, unlike other approaches, can be used by a large number of groups and users can be reliably and effectively detected in spam comments section can be used this way. Their implementation results on the Amazon dataset shows that their method is about 80.77% accurate, which, unlike other approaches, can be used by a large number of groups and users and can be reliably and effectively detected in spam. Comments section can be used this way and this method can be used safely and reliably to detect spam in the comments section

Wonka et al., 2019 [18], presented a study on the classification of spam emails for the IoT environment using a semantic similarity approach. Today, unauthorized services or product promotional messages sent through emails are read as spam. Identifying spam in the field of email is a challenging and difficult process. Nowadays methods such as statistical keyword countermeasures, conceptual address lists, and IPs are ineffective due to the difficulty in finding new attack patterns generated by malicious devices in IoT because the IoT Sufficiently complex and requires new methods in this field. Other methods of spam detection rely on a combination of conceptual knowledge engineering with machine learning techniques, but the challenge is that spammers are still using sophisticated methods today due to the sensitive nature of words through multi-word and ambiguous methods. The word uses combinatorial techniques and deceives spamming techniques. In this research, a hybrid Bayesian classification method with conceptual and semantic similarity technique to counteract ambiguity in spam detection is presented. Experimental results shows that the proposed system has high accuracy in detecting spam compared to existing approaches.

Zubar et al. In 2019 [19] presented a spam detection framework using a hybrid classification scheme. This study uses different combinations of concepts, features and feelings of users to detect spam in social networks. For this purpose, a spam-based weighting scheme is presented in this study. Experimental results shows that using the spamming feature selection method improves spam detection at the sites reviewed and by adding a weighting scheme a more refined and optimized feature can be selected and The accuracy of the proposed method increased from about 93% to 96%. Previous studies have used fewer spam-related features and feature weighting schemes. However, in this study, a weighted feature-based feature selection is presented to increase the accuracy of spam detection.

Zulfikar Alom et al. [20] the available ML-based methods cannot efficiently detect spammers on Twitter due to possible data manipulations by spam users to avoid detection mechanisms. As an alternative to ML-based detection, in this paper, they present a new approach based on deep learning (DL) techniques. their approach leverages both on tweet text as well as users’ meta-data (e.g., age of an account, number of followings/followers, and so on) to detect spammers. they compare the performance of the proposed approach with five ML-based and two DL-based state of the art approaches on two different real-world datasets, showing a gain in performance when using their approach. Table 1 summarizes these studies:

Table 1. Summary of Studies in Spam Detection

Results	Suggested Method	Research
Their implementation results suggest that the proposed system can accurately detect spam in the IoT.	A new way to detect spam in the Internet of Things was introduced.	Singh et al. 2018 [13]
This method is more accurate than techniques such as backup vector machine and random forest.	They used the concepts of deep belief networks to detect spam in the web.	Yuancheng et al. 2018[14]
The rate of spam that is incorrectly detected in their proposed method is lower than the support vector machine and the Bayesian network.	Using evolutionary methods, they developed a functional mechanism for detecting spam.	Ruano et al. 2018 [15]
The proposed approach is better at detecting spam on the Weibo social network than learning techniques such as random forest.	They developed a new approach based on semi-supervised learning to detect spam on the social network.	Chen et al. 2018 [16]
Spam detection is one of the advantages of this method and its speed is high	Hadop technology was used to detect large-scale spam.	Kumal & Yadav in 2019[17]
The proposed system is highly accurate in detecting spam than existing approaches.	They presented a research into the classification of spam emails for the IoT environment using a semantic similarity approach.	Wonka et al., 2019 [18]
Important spam detection methods have been explored on Twitter.	They presented a framework for spam detection using a combination classification scheme.	Zubar et al., 2019 [19]
The proposed approach is showing a gain in performance when using their approach.	they present a new approach based on deep learning techniques and compare the performance of the proposed approach with five ML-based and two DL-based state of the art approaches on two different real world datasets.	Zulfikar Alom et al., 2020 [20]

3. The proposed method

To distinguish spam from regular email, you can formulate the problem from a classification perspective. Here it is necessary to classify each spam and email entry into its appropriate category. To increase efficiency, it is necessary to reduce the error of the problem and hence the problem of optimization approach. Therefore, the multi-layer artificial neural network method can be considered to classify spam by regular email and the penguin optimization algorithm for feature selection in spam detection can be considered. There are two main layers to the spam detection method:

§ Learning using multilayer artificial neural network

§ Feature Selection Using Penguin Optimization Algorithm

The purpose of the Penguin Optimization Algorithm is that it has a set of precise behaviors and precise calculations that make the algorithm more accurate than conventional extra-heuristic algorithms such as genetics and particles. Emperor Penguin Optimizer (EPO) mimics the huddling behavior of emperor penguins (scientifically named as Aptenodytes forsteri). The main steps of EPO are to generate the huddle boundary, compute temperature around the huddle, calculate the distance, and find the effective mover. The huddle is assumed to be situated on two dimensional L-shape polygon plane. Firstly, emperor penguins generate the huddle boundary randomly. Thereafter, the temperature profile around the huddle is computed. The distance between emperors penguins is also calculated which will be helpful for more exploration and exploitation. Finally, the effective mover i.e., the best optimal solution is obtained and recomputed the boundary of huddle with updated positions of emperor penguins (or search agents). This new algorithm is used for feature selection. In this paper, a binary version of feature selection for spam detection is presented by Penguin Optimization Algorithm, then this version in combination with artificial neural network is used as a classification method in spam detection. The proposed framework for spam detection can be seen in Figure (3), with different phases such as preprocessing, feature selection and learning using a multi-layer artificial neural network. In this figure, it is seen that samples from the dataset are used to select the feature and then the optimal feature vector is discovered using the penguin algorithm and used to detect spam. According to the framework of the proposed method for detecting spam, the following main steps can be observed:

§ Data, email samples and spam data from the dataset are intended to teach the evaluation of the proposed method.

§ The samples in the dataset are pre-processed and normalized and the data are ready for learning and evaluation.

§ A property vector is defined as having zero and one component; zero means no feature selection and one means selecting the desired property. Here, a feature vector is considered as a penguin and a member of the penguin optimization algorithm.

§ Several feature vectors have been created randomly as members of the penguin optimization algorithm.

Figure3: The proposed method framework for detecting spam from email

§ Each vector applies the attribute to the dataset, and components that are equal to one are used as the learning attribute factor in the multilayer artificial neural network.

§ Each feature vector is evaluated with two spam detection errors from regular email and the number of features selected.

§ The feature vector that has the minimum value of the target function in each iteration is considered as the optimal feature vector and is used to evaluate the proposed method.

§ Optimized feature vectors can be used to update other feature vectors, and each feature vector is updated according to the rules of the penguin algorithm.

§ Feature vectors are implemented using binary embedded conversion functions and mapping functions and steps of the algorithm repeats regularly.

§ In the last iteration, the optimal feature vector is used to evaluate the proposed method.

In the proposed method for feature selection, a feature vector can be considered as a solution to the problem, in which the goal is to choose the optimal feature vector to reduce the error of the artificial neural network. An attribute vector for the Spambase dataset has 57 input attributes, so in this study, each feature vector has 57 attributes that can be zero or one, indicating that the attribute is not selected or selected. A feature vector can be considered as a Eq.1, and some of them can be considered as a primary feature vector like Eq. 2. considered randomly in the first step.

)1) (2)

In this respect, is the i-th feature vector for spam detection and is also considered as the j-th feature vector i-th for spam detection. The value of can be zero or one. This vector is a binary vector. In the second relation, n is the initial feature vector as the initial population created and is the optimal feature vector. A property vector can be considered as Equation 1 and some of them can be considered as the primary property vector, randomly in the first order as in Equation 2. The cost function can be modeled as a Eq. 3. to detect spam from regular email:

(3)

In this respect, Er is the average spam detection error of the actual email, F is the number of attributes selected and A is all possible attributes for spam detection, and cost is the final cost for feature selection. Two parameters and β is mentioned whose the first value is random and the latter value is (1- ) can be used in the intended cost function, such as the relation cost function shown in Eq. 4.:

Any feature vector that can minimize this function is considered as the best feature vector or the best penguin. The role of the cost function is to select features to evaluate the feature vectors. In the proposed method of a penguin and feature vector for updating the penguin optimization algorithm as in Eq. 5., at first, they calculate their distance from the optimal feature vector:

(4)

In this respect, is a feature vector, is the optimal feature vector are two parameters of the Penguin Optimization algorithm and S is also a function of the feature vector variations. Each feature vector can be updated based on this distance from the optimal penguin; this is shown in relation Eq. 6.:

(5)

By applying this relation to all feature vectors can update them. It can be seen here that the feature vector is updated based on the position of the optimal feature vector. Applying the penguin optimization algorithm to feature vectors can be updated but these changes cause these vectors or some of its components to be out of binary and change continuous mode; So they need to be binary. These functions can be described as a Gaussian function or a V function whose range is zero to one. Their equation is shown in Eq. 7. and Eq. 8. respectively:

(6)

(7)

To binary a feature vector can assign a non-binary number to a component of the feature vector in each of these functions; any closest number can be mapped to zero or one that is closest to it. Flowchart presents the proposed method for feature selection and neural network training with feature vectors in the penguin optimization algorithm in Fig. 4. It is observed that to distinguish spam from real email as a classification problem, first some random vectors with 0 and 1 random components are created and each feature vector has 57 components that only have an amount equal to 1 Is selected as the input to the Spambase dataset for the artificial neural network in the following each of the feature selection vectors are evaluated with the following two factors:

§ Generate minimum spam and email classification errors.

§ Minimum number of components or components selected.

The optimal feature vector is considered as the optimal penguin with the least possible error. In the proposed method, first of all, the feature vectors are evaluated with the cost function, then in each iteration, the optimal feature vector with minimum error and few features is calculated. In the following, a feature vector calculates its distance from the optimal feature vector or optimal penguin for spam detection. Then, with the help of the optimal feature vector and the feature vector distance from the optimal feature vector, a feature vector can be updated. Then each feature vector becomes binary using functions like V and Gaussian, then binary vectors or penguin populations are considered as the new population of penguins in the next iteration, and finally, the algorithm is repeated in the last iteration of the optimal feature vector mapping. In the last step Optimized feature vector or optimized penguin will be used for spam detection.

Figure 4. Flowchart of the proposed spam detection algorithm

4. Analysis and Evaluation

We consider each penguin as a feature vector. In this case, each feature vector has 57 cells. We want to select some of these features. Each of these randomly assigned cells can have values of 0 or 1. A value of 0 indicates that this property is not used for neural network training. We apply these attribute vectors to the dataset and use the columns that their value is 1 to learn the artificial neural network.

To implement and evaluate the proposed method can use the database in the UCI database or the Kaggle database, one of them is spam-base and is known as a global database [35]. Detect spam from regular email is a classification method that requires evaluate spam detection error than regular email as the main index, so two important MAE and RMSE indices can be used to evaluate the proposed spam detection algorithm. ؛ Which are the square and squared error of the spam detection of ordinary emails and can be formulated according to Eq. 10. and Eq.11. [21]:

(8)

In these two relations, it can be seen that the actual class number of a spam or email sample is represented by and its estimated class number by learning methods is represented by . In these relations n is the number of samples for evaluation. In the implementation of the proposed method, two cost function indices of feature selection are and the mean squared error of spam detection from regular email or is very important and these two indices can be repeated in terms of the proposed algorithm. In the proposed method, a multilayer artificial neural network with two layers is considered and have 5 neurons per layer and the number of repeats is equal to 30. These two indicators can be shown in the output of the proposed method. The proposed MATLAB 2018 and 2019 programming environment can be used to implement the proposed method. According to the data and experiments it can be said that the error of multilayer artificial neural network, recursive neural network, backup vector machine, Bayesian network, Whale optimization algorithm and the proposed algorithm in RMSE are 0.223, 0.350, 0.294, 0.388, 0.0403 and 0.0308, respectively. A comparison chart of the proposed method with these methods can be seen graphically as in Fig. 5.

Figure 5: Comparison of RMSE average error of the proposed method with learning and data mining methods

According to Fig. 5. , it can be said that among the multilayer artificial neural network methods like recursive neural network, backup vector machine, Bayesian network, and Whale optimization algorithm; the proposed algorithm has the least spam detection error. The average spam detection error of a regular email in the Whale optimization algorithm is also low. The error in the Whale optimization algorithm is 0.0403 whereas the error in proposed method is 0.0308. It can be said that the proposed algorithm has about 23.57% less error than the competing algorithm or Whale optimization algorithm. Among the machine learning methods, it can be said that the worst performance is the Bayesian network with an error of 0.388, then the worst-performing second-order functions are associated with the recursive artificial neural network. The calculations show that the multi-layer artificial neural network has less error in detecting spam than the recursive artificial neural network. We can compare the precision of the spam detection from email in the proposed method with each of the learning methods the research of the Vecca trials(Veca outputs in our experiments) as shown in Fig. 6. The precision of the proposed method can also be compared with similar studies, which can be used here as in [22], as in Fig. 7. It can be seen from the first diagram the proposed method is compared with the multilayer artificial neural network, recursive artificial neural network, Bayesian network, and the support vector machine. Experiments show that the precision of multilayer artificial neural network method, recursive artificial neural network, backup vector machine, Bayesian network and proposed method for spam detection are 94.08%, 83.91%, 91.33%, 84.75%, and 99.76%, respectively. The proposed method has the highest precision.

Figure 6: Comparison of the precision of the proposed method with learning and data mining methods

In the Fig. 7. the precision of the proposed method with similar feature selection methods is compared with other methods such as decision tree, random forest and Bayesian network. The precision of the decision tree, random forest, Bayesian network, and proposed method is 97.20%, 76%, and 99.76%, respectively.

It is observed that the proposed method error is less than the decision tree, random forest and Bayesian network in spam detection. Experiments in this section show that the proposed method is less error-prone in detecting spam than regular email by basic methods such as multilayer artificial neural network, recursive artificial neural network, Bayesian network, and support vector machine and the combined methods in feature selection such as Decision tree, random forest, and Bayesian network perform better.

Figure 7: Comparison of the precision of the proposed method with similar feature selection methods

5. Conclusions and future suggestions

Spam is a type of annoying email that is sent to Internet users for specific purposes such as advertising and stealing information. Among spam detection methods, data mining and machine learning methods have major use and can detect spam patterns. artificial neural network is a practical way of detecting spam because it can learn to detect spam patterns and use them to detect spam by partially learning it. An important challenge of multi-layer artificial neural networks in detecting spam from regular email is the output error rate and its classification. To reduce this error only important features can be used; therefore, the proposed method uses metaheuristic algorithms for feature selection. And important features are considered as input to the artificial neural network to reduce the output error. In the proposed method, we used the Emperor Penguin Optimization Algorithm as a feature selector for the multi-layer artificial neural network to reduce its output error in spam detection. Experiments and implementations The proposed method has been implemented in the MATLAB environment and the Spambase dataset is used as a global spam detection suite. The results obtained in this study show that the cost function of feature selection in implementations is descending in terms of the Penguin optimization algorithm. The reduction in the cost function of the feature selection feature, which has two components, the error and the number of features, shows that in general by running the proposed algorithm these two indices reduce and the spam detection error of regular emails is constantly decreasing. Experiments show that as population increases, feature space will decrease by43. 85% of the best situation and by 24.56% in the worst situation. Experiments show that the proposed algorithm has about 23.57% less error to detect spam than the competitor algorithm or Whale optimization algorithm. Experiments show that among the methods of machine learning to detect spam from email, the Bayesian network performs the worst with an error of 0.388. Experiments show that the precision of multilayer artificial neural network method, recursive artificial neural network, backup vector machine, Bayesian network and proposed method for spam detection are 94.08%, 83.91%, 91.33%, 84.75%, and 99.76%, respectively and the proposed method has the highest precision. The precision indexes of decision tree methods, random forest, Bayesian network, and the proposed method are 98.4%, 97.20%, 76%, and 99.76%, respectively, in the feature selection for spam detection, and the proposed method has the highest precision. One of the challenges of the proposed method is that if spammers use the image instead of the text, their ability to detect spam will be diminished, so in future research, the proposed method will be used to detect image spam by integrating image processing and Machine learning will be optimized.

Reference:

[1] Ferrara, E. (2019). The history of digital spam. Communications of the ACM, 62(8), 82-91.

[2] Nair, S. (2019, May). The Roving Proxy Framework for SMS Spam and Phishing Detection. In 2019 2nd International Conference on Computer Applications & Information Security (ICCAIS) (pp. 1-6). IEEE.

[3] Prasad, R., & Rohokale, V. (2020). Artificial Intelligence and Machine Learning in Cyber Security. In Cyber Security: The Lifeline of Information and Communication Technology (pp. 231-247). Springer, Cham.

[4] Wang, K. (2019). Blacklist filtering for security research: bridging the gap between domain blacklists and malicious web content (Doctoral dissertation).

[5] Caraffini, F., Neri, F., & Epitropakis, M. (2019). HyperSPAM: A study on hyper-heuristic coordination strategies in the continuous domain. Information Sciences, 477, 186-202.

[6] Khamis, S. A., Foozy, C. F. M., Ab Aziz, M. F., & Rahim, N. (2020, January). Header Based Email Spam Detection Framework Using Support Vector Machine (SVM) Technique. In International Conference on Soft Computing and Data Mining (pp. 57-65). Springer, Cham.

[7] Othman, N. F., & Din, W. I. S. W. (2019). Youtube spam detection framework using naïve bayes and logistic regression. Indonesian Journal of Electrical Engineering and Computer Science, 14(3), 1508-1517.

[8] Jain, A. K., Goel, D., Agarwal, S., Singh, Y., & Bajaj, G. (2020). Predicting Spam Messages Using Back Propagation Neural Network. Wireless Personal Communications, 110(1), 403-422.

[9] Zhang, Y., Wang, S., Phillips, P., & Ji, G. (2014). Binary PSO with mutation operator for feature selection using decision tree applied to spam detection. Knowledge-Based Systems, 64, 22-31.

[10] Faris, H., Aljarah, I., & Al-Shboul, B. (2016, September). A hybrid approach based on particle swarm optimization and random forests for e-mail spam filtering. In International Conference on Computational Collective Intelligence (pp. 498-508). Springer, Cham.

[11] Ala’M, A. Z., Faris, H., Alqatawna, J. F., & Hassonah, M. A. (2018). Evolving support vector machines using whale optimization algorithm for spam profiles detection on online social networks in different lingual contexts. Knowledge-Based Systems, 153, 91-104.

[12] Saha, S., DasGupta, S., & Das, S. K. (2019). Spam Mail Detection Using Data Mining: A Comparative Analysis. In Smart Intelligent Computing and Applications (pp. 571-580). Springer, Singapore.

[13] Singh, A., & Batra, S. (2018). Ensemble based spam detection in social IoT using probabilistic data structures. Future Generation Computer Systems, 81, 359-371.

[14] Li, Y., Nie, X., & Huang, R. (2018). Web spam classification method based on deep belief networks. Expert Systems with Applications, 96, 261-270.