Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)

Tur, Georvic; Homsi, Masun Nabhan

dc.creator	Tur, Georvic
dc.creator	Homsi, Masun Nabhan
dc.date	2017-09
dc.date	2017
dc.date	2017-10-26T15:21:19Z
dc.identifier	http://sedici.unlp.edu.ar/handle/10915/63208
dc.identifier	http://www.clei2017-46jaiio.sadio.org.ar/sites/default/files/Mem/SLMDI/SLMDI-07.pdf
dc.description	Abstract—Social media are increasingly being used as sources in mainstream news coverage. However, since news is so rapidly updating it is very easy to fall into the trap of believing everything as truth. Spam content usually refers to the information that goes viral and skews users views on subjects. Despite recent advances in spam analysis methods, it is still a challenging task to extract accurate and useful information from tweets. This paper aims at introducing a new approach for classification of spam and non-spam tweets using Cost-Sensitive Classifier that includes Random Forest. The approach consisted of three phases: preprocessing, classification and evaluation. In the preprocessing phase, tweets were first annotated manually and then four different sets of features were extracted from them. In the classification phase, four machine learning algorithms were first cross-validated aiming at determining the best base classifier for spam detection. Then, class imbalanced problem was dealt by resampling and incorporating arbitrary misclassification costs into the learning process. In the evaluation phase, the trained algorithm was tested with unseen tweets. Experimental results showed that the proposed approach helped mitigate overfitting and reduced classification error by achieving an overall accuracy of 89.14% in training and 76.82% in testing.
dc.description	Sociedad Argentina de Informática e Investigación Operativa (SADIO)
dc.format	application/pdf
dc.language	en
dc.rights	http://creativecommons.org/licenses/by-sa/3.0/
dc.rights	Creative Commons Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0)
dc.subject	Ciencias Informáticas
dc.subject	spam classification
dc.subject	twitter
dc.subject	topic discovering
dc.subject	cost-sensitive classifier
dc.subject	random forest
dc.title	Cost-Sensitive Classifier for Spam Detection on News Media Twitter Accounts (revised April 2017)
dc.type	Objeto de conferencia
dc.type	Objeto de conferencia

Este ítem pertenece a la siguiente institución

Universidad Nacional de La Plata (Argentina)