Improve Spam Detection in the Internet Using Feature Selection based on the Metahuristic Algorithms
Subject Areas : Evolutionary ComputingAbdulbaghi Ghaderzadeh 1 , sahar Hosseinpanahi 2 , Sarkhel Taher kareem 3
1 - Department of Computer Engineering, Sanandaj Branch, Islamic Azad University, Sanandaj, Iran
2 - Department of Computer Engineering, Sanandaj Branch, Islamic Azad University, Sanandaj, Iran
3 - Computer Department, College of Science, University of Halabja, Halabja, Iraq.
Sulaimani Polytechnic University, Technical College of Informatics,Computer Networks Department, Sulaimani, Iraq.
Keywords:
Abstract :
Improve Spam Detection in the Internet Using Feature Selection based on the Metahuristic Algorithms
Sahar Hosseinpanahia, Abdulbaghi Ghaderzadeha *1, Sarkhel H.Taher Karimb c
a Department of Computer Engineering, Sanandaj Branch, Islamic Azad University, Sanandaj, Iran.
b Computer Department, College of Science, University of Halabja, Halabja, Iraq.
c Sulaimani Polytechnic University, Technical College of Informatics,Computer Networks Department, Sulaimani, Iraq.
Abstract:
Nowadays, spam is a major challenge regarding emails. Spam is a specific type of email that is sent to the network for malicious purposes. Spam plays an important role in stealing information and can include fake links to trick users. Machine learning and data mining techniques such as artificial neural networks are the most applicable methods to detect spam. The multi-layer artificial neural network needs to select the most important features as inputs to reduce the output error for accurate spam detection. In the proposed method, a smart method based on swarm intelligence algorithms is used for feature selection. In this study, a binary version of Emperor Penguin Optimizer (EPO) is used to select more appropriate features. The proposed method uses the selected features for learning and classification in the spam detection process. Experiments in the MATLAB environment on the Spambase dataset show that with the increase in population the error in spam detection in Emails will decrease about 14.61% and with the increase in feature space, it will decrease about 43.85% in the best situation. Experiments show that the proposed method has less error in detecting spam compare to other methods, multilayer artificial neural network, recursive neural network, support vector machine, Bayesian network, and whale optimization algorithm. Experiments show that the error of spam detection in the proposed approach is about 23.57% less than the whale optimization algorithm. Empirical results, obtained through simulations on the Spambase dataset, show our approach outperforms the other existing methods on precision value.
Keywords: Spam detection, Feature selection, Metaheuristic algorithms, Emperor Penguin Optimizer (EPO)
1. Introduction
Much of the communication between users in cyberspace and the internet is done through email. Email services help users to send their text and message to other network users. You can also post links or images in the email or include attachments of a file type. Nowadays, spam is a major challenge regarding emails[1]. Spam is a specific type of email that is sent to the network for various purposes. One of these malicious purposes can be intimidating for users. In some cases, the only purpose is to advertise a product, in which case a large number of emails are sent in bulk to different people. spam can be used to steal information, in this case, fake sites are created on the Internet, and links of these pages redirected the users to fake websites to steal their Information [2]. Spam can have various types of malware such as viruses, worms, and Trojans. In this case, the victim system is infected and can expose users to attacks. Today, in most cyber-attacks a virus is sent by spam and fake emails to a large number of users on the network. In this case, each user and system can be considered as part of the attack [3]. Since spam is an important and key challenge, various methods have been proposed, including blacklisting techniques [4], heuristics approaches [5], and knowledge discovery [6]. Each of these methods can be used to detect spam, but each performs a specific mechanism. The blacklist method can filter out spam senders, but it is not efficient because it will increase the list size over time. The spam sender's address is constantly changing. As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables, exploration methods is used to identify systematic relations between variables when there are no (or not complete) a priori expectations as to the nature of those relations. In a typical exploratory data analysis process, many variables are taken into account and compared, using a variety of techniques in the search for systematic patterns. Computational Exploration methods include both simple basic statistics and more advanced, designated multivariate exploratory techniques designed to identify patterns in multivariate data sets. Exploration methods use speculation and are not very accurate, but knowledge discovery as a knowledge extraction process can uncover hidden patterns of spam and can be greatly enhanced by data-based learning and machine learning can separate spam from email. So far, methods have been developed to detect spam based on knowledge discovery, such as regression [7], artificial neural network [8], decision tree [9], random forest [10], and support vector machine [11].
In general, each of the emails and spam has a set of features that can help to identify them. One of the important and effective ways to distinguish spam from regular email is to use knowledge discovery techniques such as data mining and machine learning. These techniques are the smart ways to classify spam from regular emails. Machine learning methods are very effective in identifying spam by searching for the hidden pattern of spam in the received emails. Therefore, in this study, we attempt to provide a suitable method for detecting spam from ordinary emails using knowledge-based methods such as data mining and machine learning as suitable techniques for pattern recognition and Classification [12].
One of the major challenges in email classification methods from spam is the output error of these models. We can use meta-heuristic algorithms to minimize the errors. This paper presents a hybrid approach for detecting spam from regular email using artificial neural networks and meta-heuristic algorithms.
The outline of the paper is as follows. In Section 2, a summary of related works are presented. In Section 3, the proposed approach and the core algorithmic contributions of the paper are described , while Section 4 contains an experimental analysis, the details of implementing the proposed approach and the discussion of the experimental evaluations are presented. In Section 5, Conclusions and future suggestions are presented.
2. Related Works
The global Internet network is expanding with a growing number of users, and the role of smartphones in connecting to this network is significant. The Internet network offers a variety of communication services to users, email is one of the most useful and attractive of them. The major advantage of the email service is that it is free and easy to use, so that most online users use this web service for their daily communication and correspondence. Email service, like many web services, also has spam challenges that can pose a disadvantage to users. Spam can be considered as emails that are sent to a large number of people without the users consent, usually in bulk, with the purpose of sending them to be marketing, advertising, entertainment, information theft, harassment and more. The most dangerous type of spam can be found in information theft, in which a hacker or fisher sends deceptive emails to users using social engineering techniques and calls them to fake sites to steal valuable information from users. The disadvantages of spam are not limited to stealing information, as these types of emails waste a lot of time and even occupy processing resources for data centers and internet traffic. Identifying and detecting spam can be the most important step in filtering out these annoying emails. Spam typically has a set of features that make it possible to identify them by regular email. For example, in emails where the dollar or $ character is repeated over and over, it can be a sign of deceptive spam. Detecting spam from regular emails by a set of features of an email that can be used effectively can be effective. It is possible to identify hidden patterns in spam and extract their knowledge by methods such as machine learning and data mining. Spam patterns can be identified by various data mining techniques such as artificial neural network and based on these patterns, the incoming emails can be classified into spam and regular emails.
Singh et al., 2018 [13], presented a new method for detecting spam in the Internet of Things. This research examines how Internet of Things smart objects can publish their data to social networks such as Twitter and Facebook, and how to detect spam-based information in this communication space. The results of their studies show that the flow of data generated from intelligent objects and disseminated on social networks creates data of a large data type, whose analysis requires an appropriate platform for processing this volume of data. In this study, a semi-surveillance technique for spam detection on Twitter was developed using a framework based on classification techniques such as nearest neighbor, regression and Bayesian network. In their proposed method, learning is based on information about malicious URLs, spam users' information and spam text. Implementation and simulation results with synthetic data shows that the proposed system can accurately detect spam in the context of the IoT and social networks.
Yuancheng et al., 2018 [14], used the concepts of deep belief networks to detect spam on the web. In this study, they also incorporated the concepts of deep belief networks with deep learning methods such as SMOTE and DAE algorithms to increase their accuracy in spam detection. Their implementation results shows that this method is more accurate in detecting spam than techniques such as backup vector machine and random forest.
Ruano et al., 2018 [15], presented an application mechanism for detecting spam using evolutionary methods. In their research for spam detection, they used the mechanism of genetic programming algorithm to create regular expressions in spam detection. Their research results shows that the rate of spam that is misdiagnosed in their proposed method is lower than that of backup vector machine and Bayesian network.
Chen et al., 2018 [16], presented a new method of semi-supervised learning to detect spam in this popular Chinese network, due to the spam challenge on the Weibo social network. In their proposed approach, they used a semi-supervised learning framework by integrating information to detect spam, in which they proposed a number of users who send spam and a number of other users whose type of spam whether or not to send spam is unclear. The results of their research have shown that the proposed method performs better in spam detection on Weibo social network than learning techniques such as random forest.
Kumal and Yadav in 2019 [17] used distributed processing and learning methods based on Hadoop technology to detect large-scale spam. Online surveys are the easiest sources of free information used by organizations and customers to make decisions and can be used to counter spam. Nowadays, most organizations employ well-known and knowledgeable spam specialists with two important goals: the first purpose of promoting their products and the second purpose they are trying to make positive comments about their products and against negative opinions about competing product for this reason they are constantly sending spam. In fact, spamming in the comments section has been an important challenge in the web world, and so far spam and related spam detection in the comments section of a site has been classified as a discrete problem and generally regarded as spam and non-spam; in this research using fuzzy logic discrete state is considered to be a fuzzy state between two states that can be spam or email. Because fuzzy logic well solves real-world uncertainty, this study presents a solution based on a new fuzzy model for the problem of spam detection. In this study, four linguistic variables of fuzzy input are proposed and the suspicious level of a spam group is mapped to one of infinite, super, mega, safe, very, medium, small and weak. In this study, 81 fuzzy rules and fuzzy ranking evaluation algorithm are used to determine the suspiciousness of a set of ideas. Since the spam dataset has a large volume of users comments in practice, distributed Hadoop processing has been used to accelerate learning. Their implementation results on the Amazon dataset shows that their method is about 80.77% accurate, which, unlike other approaches, can be used by a large number of groups and users can be reliably and effectively detected in spam comments section can be used this way. Their implementation results on the Amazon dataset shows that their method is about 80.77% accurate, which, unlike other approaches, can be used by a large number of groups and users and can be reliably and effectively detected in spam. Comments section can be used this way and this method can be used safely and reliably to detect spam in the comments section
Wonka et al., 2019 [18], presented a study on the classification of spam emails for the IoT environment using a semantic similarity approach. Today, unauthorized services or product promotional messages sent through emails are read as spam. Identifying spam in the field of email is a challenging and difficult process. Nowadays methods such as statistical keyword countermeasures, conceptual address lists, and IPs are ineffective due to the difficulty in finding new attack patterns generated by malicious devices in IoT because the IoT Sufficiently complex and requires new methods in this field. Other methods of spam detection rely on a combination of conceptual knowledge engineering with machine learning techniques, but the challenge is that spammers are still using sophisticated methods today due to the sensitive nature of words through multi-word and ambiguous methods. The word uses combinatorial techniques and deceives spamming techniques. In this research, a hybrid Bayesian classification method with conceptual and semantic similarity technique to counteract ambiguity in spam detection is presented. Experimental results shows that the proposed system has high accuracy in detecting spam compared to existing approaches.
Zubar et al. In 2019 [19] presented a spam detection framework using a hybrid classification scheme. This study uses different combinations of concepts, features and feelings of users to detect spam in social networks. For this purpose, a spam-based weighting scheme is presented in this study. Experimental results shows that using the spamming feature selection method improves spam detection at the sites reviewed and by adding a weighting scheme a more refined and optimized feature can be selected and The accuracy of the proposed method increased from about 93% to 96%. Previous studies have used fewer spam-related features and feature weighting schemes. However, in this study, a weighted feature-based feature selection is presented to increase the accuracy of spam detection.
Zulfikar Alom et al. [20] the available ML-based methods cannot efficiently detect spammers on Twitter due to possible data manipulations by spam users to avoid detection mechanisms. As an alternative to ML-based detection, in this paper, they present a new approach based on deep learning (DL) techniques. their approach leverages both on tweet text as well as users’ meta-data (e.g., age of an account, number of followings/followers, and so on) to detect spammers. they compare the performance of the proposed approach with five ML-based and two DL-based state of the art approaches on two different real-world datasets, showing a gain in performance when using their approach. Table 1 summarizes these studies:
Table 1. Summary of Studies in Spam Detection
Results | Suggested Method | Research |
Their implementation results suggest that the proposed system can accurately detect spam in the IoT. | A new way to detect spam in the Internet of Things was introduced. |
Singh et al. 2018 [13]
|
This method is more accurate than techniques such as backup vector machine and random forest.
| They used the concepts of deep belief networks to detect spam in the web.
| Yuancheng et al. 2018[14] |
The rate of spam that is incorrectly detected in their proposed method is lower than the support vector machine and the Bayesian network.
| Using evolutionary methods, they developed a functional mechanism for detecting spam. | Ruano et al. 2018 [15] |
The proposed approach is better at detecting spam on the Weibo social network than learning techniques such as random forest. | They developed a new approach based on semi-supervised learning to detect spam on the social network. | Chen et al. 2018 [16] |
Spam detection is one of the advantages of this method and its speed is high | Hadop technology was used to detect large-scale spam. | Kumal & Yadav in 2019[17] |
The proposed system is highly accurate in detecting spam than existing approaches. | They presented a research into the classification of spam emails for the IoT environment using a semantic similarity approach. | Wonka et al., 2019 [18] |
Important spam detection methods have been explored on Twitter. | They presented a framework for spam detection using a combination classification scheme. | Zubar et al., 2019 [19]
|
The proposed approach is showing a gain in performance when using their approach. | they present a new approach based on deep learning techniques and compare the performance of the proposed approach with five ML-based and two DL-based state of the art approaches on two different real world datasets.
|
Zulfikar Alom et al., 2020 [20]
|
3. The proposed method
To distinguish spam from regular email, you can formulate the problem from a classification perspective. Here it is necessary to classify each spam and email entry into its appropriate category. To increase efficiency, it is necessary to reduce the error of the problem and hence the problem of optimization approach. Therefore, the multi-layer artificial neural network method can be considered to classify spam by regular email and the penguin optimization algorithm for feature selection in spam detection can be considered. There are two main layers to the spam detection method:
§ Learning using multilayer artificial neural network
§ Feature Selection Using Penguin Optimization Algorithm
The purpose of the Penguin Optimization Algorithm is that it has a set of precise behaviors and precise calculations that make the algorithm more accurate than conventional extra-heuristic algorithms such as genetics and particles. Emperor Penguin Optimizer (EPO) mimics the huddling behavior of emperor penguins (scientifically named as Aptenodytes forsteri). The main steps of EPO are to generate the huddle boundary, compute temperature around the huddle, calculate the distance, and find the effective mover. The huddle is assumed to be situated on two dimensional L-shape polygon plane. Firstly, emperor penguins generate the huddle boundary randomly. Thereafter, the temperature profile around the huddle is computed. The distance between emperors penguins is also calculated which will be helpful for more exploration and exploitation. Finally, the effective mover i.e., the best optimal solution is obtained and recomputed the boundary of huddle with updated positions of emperor penguins (or search agents). This new algorithm is used for feature selection. In this paper, a binary version of feature selection for spam detection is presented by Penguin Optimization Algorithm, then this version in combination with artificial neural network is used as a classification method in spam detection. The proposed framework for spam detection can be seen in Figure (3), with different phases such as preprocessing, feature selection and learning using a multi-layer artificial neural network. In this figure, it is seen that samples from the dataset are used to select the feature and then the optimal feature vector is discovered using the penguin algorithm and used to detect spam. According to the framework of the proposed method for detecting spam, the following main steps can be observed:
§ Data, email samples and spam data from the dataset are intended to teach the evaluation of the proposed method.
§ The samples in the dataset are pre-processed and normalized and the data are ready for learning and evaluation.
§ A property vector is defined as having zero and one component; zero means no feature selection and one means selecting the desired property. Here, a feature vector is considered as a penguin and a member of the penguin optimization algorithm.
§ Several feature vectors have been created randomly as members of the penguin optimization algorithm.
Figure3: The proposed method framework for detecting spam from email
§ Each vector applies the attribute to the dataset, and components that are equal to one are used as the learning attribute factor in the multilayer artificial neural network.
§ Each feature vector is evaluated with two spam detection errors from regular email and the number of features selected.
§ The feature vector that has the minimum value of the target function in each iteration is considered as the optimal feature vector and is used to evaluate the proposed method.
§ Optimized feature vectors can be used to update other feature vectors, and each feature vector is updated according to the rules of the penguin algorithm.
§ Feature vectors are implemented using binary embedded conversion functions and mapping functions and steps of the algorithm repeats regularly.
§ In the last iteration, the optimal feature vector is used to evaluate the proposed method.
In the proposed method for feature selection, a feature vector can be considered as a solution to the problem, in which the goal is to choose the optimal feature vector to reduce the error of the artificial neural network. An attribute vector for the Spambase dataset has 57 input attributes, so in this study, each feature vector has 57 attributes that can be zero or one, indicating that the attribute is not selected or selected. A feature vector can be considered as a Eq.1, and some of them can be considered as a primary feature vector like Eq. 2. considered randomly in the first step.
)1) (2) |
|
|
In this respect, is the i-th feature vector for spam detection and is also considered as the j-th feature vector i-th for spam detection. The value of can be zero or one. This vector is a binary vector. In the second relation, n is the initial feature vector as the initial population created and is the optimal feature vector. A property vector can be considered as Equation 1 and some of them can be considered as the primary property vector, randomly in the first order as in Equation 2. The cost function can be modeled as a Eq. 3. to detect spam from regular email:
(3)
|
|
(4) |
|
(5) |
|
(6) |
|
(7) |
|
(8) |
|
(9) |
|
(10)
|
|