Научные статьи \ Прикладные науки. Медицина. Технология \ Инженерное дело. Техника в целом \ Общее машиностроение. Ядерная технология. Электротехника. Технология машиностроения \ Электротехника

A Systematic Literature Review on SMS Spam Detection Techniques

Автор: Lutfun Nahar Lota, B M Mainul Hossain

Журнал: International Journal of Information Technology and Computer Science(IJITCS) @ijitcs

Статья в выпуске: 7 Vol. 9, 2017 года.

Бесплатный доступ

Spam SMSes are unsolicited messages to users, which are disturbing and sometimes harmful. There are a lot of survey papers available on email spam detection techniques. But, SMS spam detection is comparatively a new area and systematic literature review on this area is insufficient. In this paper, we perform a systematic literature review on SMS spam detection techniques. For that purpose, we consider the available published research works from 2006 to 2016. We choose 17 papers for our study and reviewed their used techniques, approaches and algorithms, their advantages and disadvantages, evaluation measures, discussion on datasets and finally result comparison of the studies. Although, the SMS spam detection techniques are more challenging than email spam detection techniques because of the regional contents, use of abbreviated words, unfortunately none of the existing research addresses these challenges. There is a huge scope of future research in this area and this survey can act as a reference point for the future direction of research.

Еще

SMS Spam Filtering, SMS Spam Detection, Systematic Literature Review, Machine Learning

Короткий адрес: https://sciup.org/15012663

IDR: 15012663

Текст научной статьи A Systematic Literature Review on SMS Spam Detection Techniques

Published Online July 2017 in MECS

Short Message Service (SMS) is the most frequently and widely used communication medium. The term “SMS” is used for both the user activity and all types of short text messaging in many parts of the world. It has become a medium of advertisement and promotion of products, banking updates, agricultural information, flight updates and internet offers. SMS is also employed in direct marketing known as SMS marketing. Sometimes SMS marketing is a matter of disturbance to users. These kinds of SMSs are called spam SMS. Spam is one or more unsolicited messages, which is unwanted to the users, sent or posted as part of a larger collection of messages, all having substantially identical content. The purposes of SMS spam are advertisement and marketing of various products, sending political issues, spreading inappropriate adult content and internet offers. That is why spam SMS flooding has become a serious problem all over the world. SMS spamming gained popularity over other spamming approaches like email and twitter, due to the increasing popularity of SMS communication. However, opening rates of SMS are higher than 90% and opened within 15 minutes of receipt whereas opening rate in email is only 20-25% within 24 hours of receipt [28]. Thus, a proper SMS spam detection technique has significant necessity. There are several researches on email, twitter, web and social tagging spam detection techniques. However, a very few researches have been conducted on SMS spam detection. Spam SMS detection is more challenging than email spam detection because of the restricted length of SMS, use of regional content and shortcut words and SMS contains less header information than an email.

We cannot use techniques of email spam detection asis in SMS spam detection. Proper SMS spam detection technique is needed to be identified. This is an open and comparatively new research field. There is a huge scope of research work in this field. A Systematic Literature Review (SLR) is necessary for starting any kind of research in any research field. There is no SLR on this topic. For this reason we intended to write a SLR on the field of spam SMS detection. The purpose of this study is to review the current status of SMS spam detection, finding the approaches and techniques of SMS spam detection, their advantages and disadvantages, their performance and performance measurement process using available resources to conduct a systematic literature review within time period 2006-2016. Through this research we can summarize all the researches on SMS spam detection field. This will establish a baseline for the future research. Researchers will get an overview on this research area at a glance.

II. Background and Related Work

SMS spam detection is comparatively a new research area than email, social tags, and twitter and web Spam detection. Some of the researches of Spam detection includes [1], [2], [3] etc. These researches are mostly conducted after 2011. There are several established email spam detection techniques. SMS spam detection technique has some challenges over email spam detection

such as restricted message size, use of regional and shortcut words and limited header information. These challenges need to be solved. There is scope of research in this field and some research works have been conducted on it. There are different categories of SMS spam filtering such as white listing and black listing, content-based, non-content based, collaborative approaches and challenge-response technique [4], [5], [12], [29]. The techniques are used in client side, server side or in both client and server side [4]. Several Machine Learning Algorithms such as Naïve Bayes, Support Vector Machine (SVM), Logistic Regression, Decision Trees, K-Nearest Neighbor are used to classify between Spam and legitimate SMSes named as Ham. Discussion about the machine learning algorithms, process and techniques of spam filtering is discussed in the following subsections.

A. Machine learning Algorithm

Bayesian is a probabilistic approach that starts with a prior belief, observes some data and then updates that belief. The probability being spam and not spam of a word can be calculated with the frequency of that word in ham and spam messages using the Bayesian algorithm [30]. A prior probability also needs to be assumed in this algorithm which is a shortcoming of this approach.

Support Vector Machines are supervised learning models with associated learning algorithms that analyse data used for classification and regression analysis. If a set of training example containing spam and legitimate SMS is given, then an SVM training algorithm builds a model that can assign new examples into spam and legitimate category. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on [31].

The binary logistic model is used to estimate the probability of a binary response based on one or more predictor (or independent) variables (features). Logistic regression can be used in SMS spam detection on the basis of different feature variables [32].

A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance of event outcomes. A decision tree can be used to make decision that whether a new message is spam or ham [33].

The k-nearest neighbors algorithm (k-NN) is a nonparametric method used for classification and regression. The input consists of the k closest training examples in the feature space. The output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors [34].

Random Forests grows many classification trees. To classify a new SMS from an input vector, the algorithm puts the input vector down each of the trees in the forest. Each tree gives a classification, called "votes" for that class. The forest chooses the classification having the most votes [35].

B. Spam Filtering Process

A manually classified spam and ham messages are input or training set for a spam filtering algorithm. The algorithm consists of the following steps [12].

Preprocessing: Removing irrelevant contents like stop words are the part of data preprocessing.

Tokenization: Segmenting the message according to words, characters or symbols called tokens. There are different tokenization approaches such as word tokenization, sentence tokenization, word or character Ngrams and orthogonal sparse bigrams.

Representation: Conversion to attribute value pairs.

Selection: Selecting important attribute values which have impact on classification rather than choosing all pairs of attribute value.

Training: Train the algorithm with the selected attribute values.

Testing: Test the newly arrived data with the training model.

C. Content Based Filtering

Most of the works on SMS Spam detection are content based [1], [3], [11], [12]. Content based filtering is based on the contents of SMS like spam words, unusual distribution of punctuations and message length. Yadav et al. [1] proposed a user centric approach that used content based filtering using Bayesian machine learning algorithm with user generated features like blacklisting and white listing, preferred keywords to filter unwanted SMSes and reduced the burden of notifications for a mobile user.

Narayan et al. [3] developed a two level stacked classifier to classify between spam and legitimate SMS. The first level of classifier records a subset of words whose individual probability is higher than a threshold. After that second level of classifier is invoked, this takes the chosen words form first level as input. They took different combinations of machine learning classification algorithms in two levels such as Bayesian and SVM, SVM and Bayesian, Bayesian and Bayesian, SVM and SVM.

Ishtiaq et al. [11] proposed a SMS spam classification algorithm using the combination of Naive Bayes classifier and Apriori algorithm. They integrated association rule mining using Apriori algorithm with Bayesian algorithm. Apriori retrieves the most frequent words occurred together then Bayesian calculates the probability of occurring a word independently and together with other words, in spam or ham messages.

Gomez et al. [12] analysed to what extent Bayesian filtering techniques used to block email spam, can be applied to the problem of detecting and stopping mobile spam. They pre-processed the messages with different tokenization approach, selected features and tested them with different machine learning algorithms, in terms of effectiveness. They demonstrated that Bayesian filtering techniques can be effectively transferred from email to SMS spam with appropriate feature extraction.

D. Non-Content based filtering

Many proposed techniques used non-content based filtering [2], [7]. Warade et al. [2] detected the spam messages by checking mutual relation between the sender and receiver and the content of the messages. If no mutual relation is found between sender and receiver and message contains spam contents, then the system tags the message as spam and sends it to spam box. If mutual relation and no spamming content exist then it directly sends to inbox of the receivers mobile. It solved the problem of balance deduction and wastage of SMS memory. But calculating only mutual relation is not a proper solution. Spam detection algorithm needs both classification algorithm and this kind of feature extraction from contents.

Qian Xu et al. [7] investigated ways to detect spam message senders based on non-content features that include temporal and graph-topology information but exclude contents because of user-privacy issues. They focused on the problem of identifying professional spammers based on the overall message sending patterns. Furthermore, they only concentrated on finding SMS spam on the server side, as the client-side detection is mostly content based.

E. Feature Engineering

The success of machine learning depends mostly on appropriate feature selection [6]. The feature can be both content based and non-content based. The ref. [2], [8] focused only on non-content based features like mutual relation of sender and receiver, user black-listing and white listing and user preferred keywords words. Whereas some researchers considered only content based features [6]. A proper spam detection algorithm needs both content and non-content based features. Non-content based features include static, temporal and network features [7]. Content based features are word frequencies [11] and keyword based features are presence of spam words and stylistic features are count of exclamation, count of alphanumeric word, average word length and many others [15].

III. Methodology

A systematic review collects and critically analyses multiple research studies or papers or journals and provides the summary of the existing literature on a specific research domain [9]. A review of existing studies is often quicker and cheaper than embarking on a new study. For conducting SLR, some steps need to be followed as mentioned in [9]. The steps include formulating research questions, finding and analysing researches that relate to the questions, answering to that questions and demonstrating a summarized result of literature survey. Details of these steps are discussed in the following sub-sections.

A. The Need for a Systematic Literature Survey

Email Spam detection is an established research field. Many researches and literature survey have been done on email spam detection as well as for twitter, web and social tag spam detection. There is insufficient systematic literature survey available on SMS spam detection because of its being comparatively new research area. Although SMS communication has started mostly in 2000, it gained its popularity in 2006 and even became more popular after the flourishment of android phones [19]. With the increase of the number of people using SMS as a communication medium, SMS spamming also gets more popularity to spammers. As a result, research on SMS spam detection had emerged with its necessity and researches on it have started mainly after 2007. Our goal with this SLR is collecting proper background knowledge on SMS spam detection field, gaining knowledge about the currently used algorithms for SMS spam detection, their advantages and disadvantages, identifying the evaluation measure for the spam detection algorithms, comparing the accuracy of the algorithms, identifying any gaps in current research in order to suggest areas for further investigation. The motivation for this work is to establish a basis for any research on SMS spam detection. Any kind of research starts on the basis of systematic literature review. This is the main rationale of this SLR.

B. Research Questions

Identifying research question is one of the important steps in a SLR. We have identified three research questions for this SLR. The questions and their motivation are presented in table 1.

Table 1. Research Question

RQ1. What are the current approaches of SMS spam detection?	To identify the algorithms used for SMS spam detection.
RQ2. What are the advantages and disadvantages of the algorithms?	To understand the convenience and drawbacks of the algorithms.
RQ3. What are the measurement policies of SMS Spam detection algorithms?	To identify existing measurement policies and metrics to evaluate the algorithms.

C. Searches for Studies

At first we searched with the term 'SMS spam detection' on Google Scholar. Then we identified keywords noted in the relevant papers. After that we identified alternative spelling and synonyms for search terms. Some examples of resulting search string are given below: “SMS Spam”, “SMS Spam Filtering”, “Machine Learning”, “Security and Protection”, “Text Analysis”, “Security in Mobile Communication”, “Short Message Service”, “Naive Bayesian Algorithm”, and “Anti-Spam Filtering”.

D. Study Selection Procedure

To select relevant studies we primarily searched on google scholar. We have collected some papers from it. There are some other conferences and journals such as: IEEExplore, ACM, IJCSI, ITJ are found through google scholar tool. The list of journals and conferences from where we have found our relevant papers is presented table 2. We also performed manual google search. Selected paper contains many references; we also searched for the referenced papers and have taken some of them as our relevant paper. We used the google scholar's related articles and cited by feature for our searching procedure.

Table 2. Sources Searched

No.	Source	Abbreviation
1	IEEE	IEEE Xplore
2	ACM	Association for Computing Machinery
3	IJCSI	International Journal of Computer Science Issues
4	IJISS	International Journal of Information Security Science
5	ITJ	Information Technology Journal
6	IJRAT	International Journal of Research in Advent Technology
7	IJITCS	International Journal of Information Technology and Computer Science
7	CAE	International CAE Conference
8	Google Scholar
9	Google
10	Computers and Security
11	Expert Systems With Applications

E. Study Selection Criteria

IV. Validation of the Study

There are inclusion and exclusion criteria for systematic literature review. SMS spam detection is a new research area and there are not much relevant studies in this field. That is why we chose most of the available articles.

F. Data Extraction

Table 3 contains the extraction form used to gather extracted information from our study. This table demonstrates information about our chosen data such as chosen papers type, their published conferences, publication years, motivation and methodology of paper.

Table 3. Data Extraction Table

Data item	Value
Study identifier	S#
Paper type	Conference/ Journal
Name of e-library	e.g. ACM
Year of publication	2006- 2016
Name of journal	e.g: ITJ
Which RQ was answered	RQ1/ RQ1/RQ3
Outcomes of the paper	Summarized literature survey on SMS spam detection
Motivation of paper	Create a baseline for SMS spam detection
Method of paper	Techniques/ Approaches / Algorithms
Validation of paper	Analysis model

Our SLR was conducted to investigate all the used approaches and techniques in SMS spam detection. The threats to the validity of our review are that there may be selection bias and lack of sufficient resources. We tried to reach all possible and relevant information resources. Some resources might not have been published directly. Another threat is some resources are not available for public use.

V. Result Analysis

At first we manually searched on google using the topic Spam Detection to gain an overview in spam detection field. It resulted in many email, twitter, web and SMS spam detection related papers. Then we customized our search using only SMS spam detection. It resulted in a few papers. Although there are SLR for other spam detection techniques but none of the search strings produces a SLR for SMS Spam detection. Through our study selection procedure we have chosen 17(S1-S17) papers published in different conferences and journals relating only to SMS spam detection. Among the 17 studies S1 and S11 are from same authors and S11 is an extension of S1. The ref. [20] is a journal which is an extension of the conference paper S10. S12 is an extension of [8]. As a result, in total we have studied 19 studies. Table 4 summarizes the reviewed papers Study ID with the reference no given in reference section, publication years, name of the conferences and journals where the papers published and the research questions they answered.

Table 4. Summary of the Reviewed Literature

Study ID	Year	Conference/Journal	Answer Research Question
S1 [1]	2012	IEEE	RQ1, RQ2, RQ3
S2 [2]	2014	IJRAT	RQ1
S3 [3]	2013	ACM	RQ1,RQ2, RQ3
S4 [5]	2010	Computers and Security	RQ1,RQ2, RQ3
S5 [10]	2012	IJCSI	RQ1 RQ2, RQ3
S6 [11]	2014	IJMLC	RQ1 RQ2, RQ3
S7 [12]	2006	ACM	RQ1 RQ2, RQ3
S8 [7]	2012	IEEE	RQ1, RQ3
S9 [13]	2015	CAE	RQ1 RQ2, RQ3
S10 [14]	2011	ACM	RQ1, RQ3
S11 [15]	2011	ACM	RQ1, RQ3
S12 [16]	2007	ACM	RQ1, RQ3
S13 [17]	2008	ITJ	RQ1, RQ3
S14 [18]	2014	ASTL	RQ1
S15 [4]	2015	Information Security Journal	RQ1 RQ2, RQ3
S16[21]	2014	JBASR	RQ1 RQ2, RQ3
S17[22]	2013		RQ1, RQ3

Table 5. Summary of the Techniques Used by the Literature

Study ID	Techniques/ Algorithms/ Approaches	Description
S1 [1]	Content Based (Bayesian)	SMSAssassin: Android application uses content based filtering with user generated features to automatically filter spam SMSes resulting in different tabs.
S2 [2]	Mutual Relation	Based on the previous relation of sender and receiver.
S3 [3]	Two level stacked classifier	First level Records some words more than a threshold then sends them to the next level using Bayesian in both level and Bayesian in 1^st level and SVM in second level.
S4 [5]	Hybrid Approach(Content Based and Challenge – Response)	Used upper and lower bound of threshold introducing an uncertain region for Bayesian filtering after that the messages which fall into uncertain region sent to the challenge – response technique which is user query based.
S5 [10]	Artificial Immune System	The phases of the algorithm are Building dataset, Message Matching and Affinity Calculation.
S6 [11]	Bayesian and Apriori Algorithm	Apriori retrieves the most frequent words occurred together then Bayesian calculates the probability of occurring a word independently and together with other words, in spam or ham messages.
S7 [12]	Bayesian	Message pre-processing and encoding, feature selection and then applying the classification algorithm.
S8 [7]	Non- Content Based	Non content - based features such as static, temporal and network features then classification with SVM and KNN.
S9 [13]	Bayesian with modified formula	Total number of spam SMSes are divided by the total occurrences of a word in Spam/Ham messages instead of the formula of occurrences of words divided by the total number of Spam and Ham messages. They also combined two formulas.
S10 [14]	Tokenization with various classifier	Two kinds of tokenization : separated by blanks and separated by special characters are used for classification in various machine learning algorithms.
S11 [15]	Bayesian and SVM	Tested the feasibility of applying both algorithms in mobile application domain.
S12 [16]	Content Based filtering	Machine learning algorithms with Lexical feature expansion such as words, orthogonal sparse word bigrams, character bigrams and trigrams.
S13 [17]	Feature Updating Protocol	At a regular interval on the basis of new arrival of SMSes Features will be updated using methods like document frequency, term frequency, information gain and mutual information.
S14 [18]	Virtual Ratio on Naïve Bayes, J-48 and logistic regression	VR is the relative ratio of average frequency of a keyword in spam and ham messages.
S15 [4]	Artificial Immune System	Consists of five modules: Innate mechanism, User feedback, Quarantine, Tokenizer, Immune Engine.
S16 [21]	Bayesian, Multilayer Perceptron Algorithm, Decision Tree	Selected four features and performed classification algorithms on them resulting in better performance in Bayesian
S17 [22]	Content based Filtering	Feature extraction and classification algorithms like Bayesian, SVM, K- Nearest neighbour, Random forest and Adaboost. There results concludes SVM outperforms other algorithms.

A. RQ1: What are the current approaches of SMS spam detection?

The used techniques, approaches and algorithms in spam detection and their short description with their study id is described in table 5. From the table we can see that, most of the approaches use content based filtering and for classification they used several machine learning algorithms mostly Bayesian and SVM. Study S1, S3, S6-S7, S9-S12, S14, S16-S17 used content based filtering. S4 is a hybrid approach, S8 is non content based, and S5 and S15 are based on artificial immune system. Most of the content based filtering used Bayesian as a classification algorithm.

B. RQ2: What are the advantages and disadvantages of the algorithms?

Table 6 demonstrates the result of RQ2. The advantages and drawbacks of the approaches are mentioned in the tables. From the table we can say that, content based filtering is more convenient than other noncontent based and server side algorithms. Server side algorithms suffer from implementation complexity. Feature selection is also an important task for machine learning algorithms to work correctly. One important drawback is, some approaches do not use classification algorithm only focusing on user generated features. Classification algorithm is necessary for gaining better accuracy.

Table 6. Advantages and Disadvantages of Used Techniques

Study ID	Advantages	Disadvantages
S1 [1]	Combination of machine learning algorithms with user generated features	Users need to select features manually
S2 [2]		No classification algorithm is used
S3 [3]	Classification based on two algorithms	Threshold selection
S4 [5]	Combination of client and server side algorithms	Challenge-response technique suffers from server side traffic and user interaction problems
S5 [10]	Accurate as Naïve Bayesian with necessary feature extraction	Complex implementation
S6 [11]	Incorporating Apriori Algorithm
S7 [12]	Used a weighting Mechanism to reduce false negatives
S8 [7]		Suffers from implementation complexity
S9 [13]	Combination of two formulas gives better result in terms of false positives
S10 [14]	Concludes SVM outperforms other algorithms and created a baseline for further comparison
S11 [15]	Although SVM gives better results in Spam identification Bayesian is more feasible for mobile applications	Extensive feature engineering is needed for better accuracy
S12 [16]	Demonstrates the need of spam filtering in spite of having established email spam filtering
S14 [18]	Lightweight and focuses on runtime
S15 [4]		Server side, complex and suffers from updating issues
S16[21]		Implementation complexity

C. RQ3: What are the measurement policies of SMS Spam detection algorithms?

Accuracy of the SMS spam detection needs to be measured. In table 7, we have demonstrated the method or matrix to measure the algorithms for each study. Calculating accuracy from confusion matrix is one of the most commonly used measurement methods for classification algorithms. S3-S9, S11, S15, S17 used accuracy to measure their algorithm. Receiver operating characteristics (ROC) and Area under the curve (AUC) were also used to demonstrate algorithm accuracy. S7, S8, S11, S12 used ROC and AUC methods. True Positive rate, False Positive rate, F-measure, Precision, and Recall are also measurement methods for classification algorithms, which can be calculated from confusion matrix. Some of the studies also used these measures. S2 and S14 do not use any evaluation measure.

D. Dataset Description

A training dataset is needed for any kind of machine learning classification algorithms. Results of the machine learning algorithms depend on the dataset. As a result spam detection algorithms can't run without a dataset. In table 8, we demonstrated different publicly available dataset used in different studies. Link of the dataset and some statistics such as total number of SMSes, number of Spam and Ham messages are shown in the table 8.

E. Performance Comparison

Most of the results of our studies demonstrated that Bayesian filtering is more suitable for spam detection. S3 showed that a two level stacked classifier using dataset referenced in [25] gives better accuracy of 99% with threshold 0.4 and 0.6 than the single classifiers. Hybrid approach of S4 demonstrates accuracy of 95%. S15 and S5 based on artificial immune system shows accuracy of 99% and 98% respectively. S6 gives 98%-100% on the dataset [26] but they did not consider all the data instead they choose small portions of the dataset and this accuracy is achievable only for some specific parts of the dataset. S9 shows 89% accuracy on some publicly not available Farsi SMS dataset with modified Bayesian formula. 97% accuracy is achieved by SVM with Spam Caught Rate 83.10% and Blocked Ham rate 0.18% on the dataset [26]. Whereas 98% accuracy is achieved by SVM on the same dataset [26] with Spam Caught Rate 92% and Blocked Ham Rate 0.31% in S17. This observation shows that results not only depends on classification algorithms and datasets but also on data preprocessing and feature selection process. S17 also demonstrates accuracy 98% on Bayesian with Spam caught rate 94% and Blocked Ham Rate 0.51%. S11 shows 97% ham accuracy and 72.5% spam accuracy on Bayesian and 93% ham accuracy and 86% spam accuracy on SVM on some publicly not available dataset. S16 showed 92 % correctly classified instances and 8% incorrectly classified instances on Bayesian which is better than Multilayer perceptron and Decision tree.

Table 7. Evaluation Measures of the Algorithms

Study ID	Evaluation Measure
S1 [1]	No evaluation measure only demonstrate their application
S2 [2]
S3 [3]	Precision, Recall, F-measure and Accuracy
S4 [5]	Traffic amount, Accuracy
S5 [10]	Accuracy, False Positive Rate
S6 [11]	Accuracy
S7 [12]	ROCCH
S8 [7]	False Positive Rate, AUC
S9 [13]	Confusion matrix, Precision, Accuracy, F-measure
S10 [14]	Spam Caught/True Positive, Blocked ham/False positive, Accuracy and Matthews Correlation Coefficient (MCC)
S11 [15]	Ham, Spam identification Accuracy and Area Under the Curve (AUC)
S12 [16]	ROC, AUC
S15 [4]	Confusion Matrix, Accuracy, AUC
S16[21]	Correctly and Incorrectly classified Instances
S17[22]	Spam Caught(SC), Blocked Ham(BH), Accuracy (ACC)

Table 8. Dataset Description

Study ID	Available At	Total No. of Messages	Hams	Spams
S1 [1]	[24]	2000	1000	1000
S3 [3]	[25]	1450	730	721
S4 [5]			85.32%	14.75%
S6 [11]	[26]	5574	4827	747
S7 [12]	[27]
S10 [14]	[26]	5574	4827	747
S11 [15]		4318	2195	2123
S12 [16]
S14[18]	[26]/	5574	4827	747
S15 [4]		5240	2890	2350
S17 [22]	[26]	5574	4827	747

VI. Discussion

In light of the above discussion, we can say that most of the research studies answered RQ1. They mostly used content based filtering with various machine learning algorithms. Eleven research studies used content based filtering, two studies used artificial immune system, one of them used hybrid approach, two of them focused on feature engineering and two of them focused on real world data set. Content based filtering suffers from challenges like short content, abbreviated words and user content safety. All of the studies tried to solve some challenges of SMS spam detection. For example, some studies solved real world data extraction process, some studies proposed hybrid approach to give better accuracy, some studies tried to overcome the challenges over email spam detection. None of the techniques solved the challenge of the use of regional content and shortcut words. These challenges lead to the future researchers to further investigation on the used approaches and techniques. Also most of the studies used Bayesian filtering for classification algorithm. Bayesian algorithm also suffers from traditional threshold selection problem, dataset dependency, assuming prior probability. Despite having those shortcomings, Bayesian is declared as the most suitable algorithm for spam filtering. Solving these problems of Bayesian also can be a research direction. This can result in better performance in Bayesian algorithm. SVM also gives better accuracy but suffers from implementation complexity. Other algorithms are less suitable for SMS Spam filtering.

VII. Conclusion

This paper presents the results of the systematic literature review on SMS spam detection techniques. We chose a total of 17 research papers on this field and reviewed their proposed techniques, advantages and disadvantages and challenges they addressed. We also examined their evaluation procedures. We demonstrated the publicly available dataset information which is a prior need for a spam filtering algorithm. We also discussed the background of this topic. In our systematic literature review, we have discussed the search and selection procedure, their publication years and the journals and conferences where those studies were published. Our results show the summary of the used techniques and advantages and disadvantages of the approaches. We have performed a performance comparison on the studied literature. In addition, we have found that none of the studies solve the challenges of use of regional contents and shortcut words. We have also discussed the problems of traditional machine learning algorithms. There is scope of further research in this filed and our systematic literature review can serve as a reference point for future researches.

Список литературы A Systematic Literature Review on SMS Spam Detection Techniques

K. Yadav, S. K. Saha, P. Kumaraguru, and R. Kumra, “Take control of your smses: Designing an usable spam sms filtering system,” in 2012 IEEE 13th International Conference on Mobile Data Management. IEEE, 2012, pp. 352–355.
S. J. Warade, P. A. Tijare, and S. N. Sawalkar, “An approach for sms spam detection.”
A. Narayan and P. Saxena, “The curse of 140 characters: evaluating the efficacy of sms spam detection on android,” in Proceedings of the Third ACM workshop on Security and privacy in smartphones & mobile devices. ACM, 2013, pp. 33–42.
A. S. Onashoga, O. O. Abayomi-Alli, A. S. Sodiya, and D. A. Ojo, “An adaptive and collaborative server side sms spam filtering scheme using artificial immune system,” Information Security Journal: A Global Perspective, vol. 24, no. 4-6, pp. 133–145, 2015.
J. W. Yoon, H. Kim, and J. H. Huh, “Hybrid spam filtering for mobile communication,” computers & security, vol. 29, no. 4, pp. 446–459, 2010.
S. J. Delany, M. Buckley, and D. Greene, “Sms spam filtering: methods and data,” Expert Systems with Applications, vol. 39, no. 10, pp. 9899–9908, 2012.
Q. Xu, E. W. Xiang, Q. Yang, J. Du, and J. Zhong, “Sms spam detection using noncontent features,” IEEE Intelligent Systems, vol. 27, no. 6, pp. 44–51, 2012.
G. V. Cormack, J. M. G. Hidalgo, and E. P. S´anz, “Feature engineering for mobile (sms) spam filtering,” in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2007, pp. 871–872.
S. Keele, “Guidelines for performing systematic literature reviews in software engineering,” in Technical report, Ver. 2.3 EBSE Technical Report. EBSE, 2007.
T. M. Mahmoud and A. M. Mahfouz, “Sms spam filtering technique based on artificial immune system,” IJCSI International Journal of Computer Science Issues, vol. 9, no. 1, pp. 589–597, 2012.
I. Ahmed, D. Guan, and T. C. Chung, “Sms classification based on naïve bayes classifier and apriori algorithm frequent itemset,” International Journal of machine Learning and computing, vol. 4, no. 2, p. 183, 2014.
J. M. G´omez Hidalgo, G. C. Bringas, E. P. S´anz, and F. C. Garc´ıa, “Content based sms spam filtering,” in Proceedings of the 2006 ACM symposium on Document engineering. ACM, 2006, pp. 107–114.
M. Poorshahsavari and O. Pourgalehdari, “Enhancing the rate of accuracy and precision in spam filtering in farsi sms.”
T. A. Almeida, J. M. G. Hidalgo, and A. Yamakami, “Contributions to the study of sms spam filtering: new collection and results,” in Proceedings of the 11th ACM symposium on Document engineering. ACM, 2011, pp. 259–262.
K. Yadav, P. Kumaraguru, A. Goyal, A. Gupta, and V. Naik, “Smsassassin: crowdsourcing driven mobile-based system for sms spam filtering,” in Proceedings of the 12th Workshop on Mobile Computing Systems and Applications. ACM, 2011, pp. 1–6.
G. V. Cormack, J. M. G´omez Hidalgo, and E. P. S´anz, “Spam filtering for short messages,” in Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM, 2007, pp. 313–320.
Q. Sun, H. Qiao, and Z. Luo, “The feature updating algorithm for short message content filtering,” Information Technology Journal, vol. 7, no. 5, pp. 790–795, 2008.
S.-E. Kim, J.-T. Jo, and S. S.-E. Kim, J.-T. Jo, and S.-H. Choi, “A spam message filtering method: focus on run time,” 2014.
A Brief History of Text Messaging, http://mashable.com/2012/09/21/text-messaging-history/#F4V9_15QGkqx. [Last Accessed: 05-11-2016]
Almeida, Tiago, José María Gómez Hidalgo, and Tiago Pasqualini Silva. "Towards sms spam filtering: Results under a new dataset." (2013): 1-18.
Mujtaba, G., and M. Yasin. "SMS spam detection using simple message content features." J. Basic Appl. Sci. Res 4 (2014): 275-279.
Shirani-Mehr, Houshmand. "SMS spam detection using machine learning approach." (2013): 1-4.
Ahmed, Ishtiaq, et al. "Semi-supervised learning using frequent itemset and ensemble learning for SMS classification." Expert Systems with Applications42.3 (2015): 1065-1073.
http://precog.iiitd.edu.in/resources.html [Last Accessed: 05-11-2016]
https://github.com/okkhoy/SpamSMSData. [Last Accessed: 05-11-2016]
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/ [Last Accessed: 05-11-2016]
http://www.esp.uem.es/jmgomez/smsspamcorpus/ [Last Accessed: 05-11-2016]
https://www.cloudmark.com/en/s/resources/whitepapers/sms-spam-overview [Last Accessed: 05-11-2016]
Iqbal, Muhammad, et al. "Study on the Effectiveness of Spam Detection Technologies." (2016).
http://fastml.com/bayesian-machine-learning/ [Last Accessed: 05-11-2016]
https://en.wikipedia.org/wiki/Support_vector_machine [Last Accessed: 05-11-2016]
https://en.wikipedia.org/wiki/Logistic_regression [Last Accessed: 05-11-2016]
https://en.wikipedia.org/wiki/Decision_tree [Last Accessed: 05-11-2016]
https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm [Last Accessed: 05-11-2016]
https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm [Last Accessed: 05-11-2016]

Еще