Spam Filtering: How The Dimensionality Reduction Affects The Accuracy Of Naive Bayes Classifiers

Almeida T.A.; Almeida J.; Yamakami A.

dc.creator	Almeida T.A.
dc.creator	Almeida J.
dc.creator	Yamakami A.
dc.date	2011
dc.date	2015-06-30T20:22:16Z
dc.date	2015-11-26T14:48:30Z
dc.date	2015-06-30T20:22:16Z
dc.date	2015-11-26T14:48:30Z
dc.date.accessioned	2018-03-28T21:59:18Z
dc.date.available	2018-03-28T21:59:18Z
dc.identifier
dc.identifier	Journal Of Internet Services And Applications. , v. 1, n. 3, p. 183 - 200, 2011.
dc.identifier	18674828
dc.identifier	10.1007/s13174-010-0014-7
dc.identifier	http://www.scopus.com/inward/record.url?eid=2-s2.0-79952048598&partnerID=40&md5=76ae5955329f69963766249d53f94c49
dc.identifier	http://www.repositorio.unicamp.br/handle/REPOSIP/107703
dc.identifier	http://repositorio.unicamp.br/jspui/handle/REPOSIP/107703
dc.identifier	2-s2.0-79952048598
dc.identifier.uri	http://repositorioslatinoamericanos.uchile.cl/handle/2250/1253631
dc.description	E-mail spam has become an increasingly important problem with a big economic impact in society. Fortunately, there are different approaches allowing to automatically detect and remove most of those messages, and the best-known techniques are based on Bayesian decision theory. However, such probabilistic approaches often suffer from a well-known difficulty: the high dimensionality of the feature space. Many term-selection methods have been proposed for avoiding the curse of dimensionality. Nevertheless, it is still unclear how the performance of Naive Bayes spam filters depends on the scheme applied for reducing the dimensionality of the feature space. In this paper, we study the performance of many term-selection techniques with several different models of Naive Bayes spam filters. Our experiments were diligently designed to ensure statistically sound results. Moreover, we perform an analysis concerning the measurements usually employed to evaluate the quality of spam filters. Finally, we also investigate the benefits of using the Matthews correlation coefficient as a measure of performance. © The Brazilian Computer Society 2010.
dc.description	1
dc.description	3
dc.description	183
dc.description	200
dc.description	Almeida, T., Yamakami, A., Content-based spam filtering (2010) Proceedings of the 23rd IEEE international joint conference on neural networks, pp. 1-7. , Spain, Barcelona
dc.description	Almeida, T., Yamakami, A., Almeida, J., Evaluation of approaches for dimensionality reduction applied with Naive Bayes anti-spam filters (2009) Proceedings of the 8th IEEE international conference on machine learning and applications, pp. 517-522. , Miami, FL, USA
dc.description	Almeida, T., Yamakami, A., Almeida, J., Filtering spams using the minimum description length principle (2010) Proceedings of the 25th ACM symposium on applied computing, pp. 1856-1860. , Sierre, Switzerland
dc.description	Almeida, T., Yamakami, A., Almeida, J., Probabilistic antispam filtering with dimensionality reduction (2010) Proceedings of the 25th ACM symposium on applied computing, pp. 1802-1806. , Sierre, Switzerland
dc.description	Androutsopoulos, I., Koutsias, J., Chandrinos, K., Paliouras, G., Spyropoulos, C., An evaluation of Naive Bayesian anti-spam filtering (2000) Proceedings of the 11st European conference on machine learning, pp. 9-17. , Barcelona, Spain
dc.description	Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C., Stamatopoulos, P., Learning to filter spam e-mail: a comparison of a Naive Bayesian and a memory-based approach (2000) Proceedings of the 4th European conference on principles and practice of knowledge discovery in databases, pp. 1-13. , Lyon, France
dc.description	Androutsopoulos, I., Paliouras, G., Michelakis, E., (2004) Learning to filter unsolicited commercial e-mail, , Technical Report 2004/2, National Centre for Scientific, Research "Demokritos", Athens, Greece
dc.description	Baldi, P., Brunak, S., Chauvin, Y., Andersen, C., Nielsen, H., Assessing the accuracy of prediction algorithms for classification: an overview (2000) Bioinformatics, 16 (5), pp. 412-424
dc.description	Bratko, A., Cormack, G., Filipic, B., Lynam, T., Zupan, B., Spam filtering using statistical data compression models (2006) J Mach Learn Res, 7, pp. 2673-2698
dc.description	Carpinter, J., Hunt, R., Tightening the Net: a review of current and next generation spam filtering tools (2006) Comput Secur, 25 (8), pp. 566-578
dc.description	Carreras, X., Marquez, L., Boosting trees for anti-spam email filtering (2001) Proceedings of the 4th international conference on recent advances in natural language processing, pp. 58-64. , Tzigov Chark, Bulgaria
dc.description	Cohen, W., Fast effective rule induction (1995) Proceedings of 12nd international conference on machine learning, pp. 115-123. , Tahoe City, CA, USA
dc.description	Cohen, W., Learning rules that classify e-mail (1996) Proceedings of the AAAI spring symposium on machine learning in information access, pp. 18-25. , Stanford, CA, USA
dc.description	Cormack, G., Email spam filtering: a systematic review (2008) Found Trends Inf Retr, 1 (4), pp. 335-455
dc.description	Cormack, G., Lynam, T., Online supervised spam filter evaluation (2007) ACM Trans Inf Syst, 25 (3), pp. 1-11
dc.description	Cunningham, P., Nowlan, N., Delany, S., Haahr, M., A casebased approach to spam filtering that can track concept drift (2003) Proceedings of the 5th international conference on case based reasoning, pp. 115-123. , Trondheim, Norway
dc.description	Demsar, J., Statistical comparisons of classifiers over multiple data sets (2006) J Mach Learn Res, 7, pp. 1-30
dc.description	Drucker, H., Wu, D., Vapnik, V., Support vector machines for spam categorization (1999) IEEE Trans Neural Netw, 10 (5), pp. 1048-1054
dc.description	Forman, G., An extensive empirical study of feature selection metrics for text classification (2003) J Mach Learn Res, 3, pp. 1289-1305
dc.description	Forman, G., Kirshenbaum, E., Extremely fast text feature extraction for classification and indexing (2008) Proceedings of 17th ACM conference on information and knowledge management, pp. 1221-1230. , Napa Valley, CA, USA
dc.description	Forman, G., Scholz, M., Rajaram, S., Feature shaping for linear SVM classifiers (2000) Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp. 299-308. , Paris, France
dc.description	Friedman, N., Geiger, D., Goldszmidt, M., Bayesian network classifiers (1997) Mach Learn, 29 (3), pp. 131-163
dc.description	Fuhr, N., Buckley, C., A probabilistic learning approach for document indexing (1991) ACM Trans Inf Syst, 9 (3), pp. 223-248
dc.description	Galavotti, L., Sebastiani, F., Simi, M., Experiments on the use of feature selection and negative evidence in automated text categorization (2000) Proceedings of 4th European conference on research and advanced technology for digital libraries, pp. 59-68. , Lisbon, Portugal
dc.description	Guzella, T., Caminhas, W., A review of machine learning approaches to spam filtering (2000) Exp Syst Appl, 36 (7), pp. 10206-10222
dc.description	Hidalgo, J., Evaluating cost-sensitive unsolicited bulk email categorization (2002) Proceedings of the 17th ACM symposium on applied computing, pp. 615-620. , Madrid, Spain
dc.description	Joachims, T., A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization (1997) Proceedings of 14th international conference on machine learning, pp. 143-151. , Nashville, TN, USA
dc.description	John, G., Langley, P., Estimating continuous distributions in Bayesian classifiers (1995) Proceedings of the 11st international conference on uncertainty in artificial intelligence, pp. 338-345. , Montreal, Canada
dc.description	John, G., Kohavi, R., Pfleger, K., Irrelevant features and the subset selection problem (1994) Proceedings of 11st international conference on machine learning, pp. 121-129. , New Brunswick, NJ, USA
dc.description	Kira, K., Rendell, L., A practical approach to feature selection (1992) Proceedings of the 9th international workshop on machine learning, pp. 249-256. , Aberdeen, Scotland, UK
dc.description	Kolcz, A., Alspector, J., SVM-based filtering of e-mail spam with content-specific misclassification costs (2001) Proceedings of the 1st international conference on data mining, pp. 1-14. , San Jose, CA, USA
dc.description	Koprinska, I., Poon, J., Clark, J., Chan, J., Learning to classify e-mail (2007) Inf Sci, 177 (10), pp. 2167-2187
dc.description	Lemire, D., Scale and translation invariant collaborative filtering systems (2005) Inf Retr, 8 (1), pp. 129-150
dc.description	Losada, D., Azzopardi, L., Assessing multivariate Bernoulli models for information retrieval (2008) ACM Trans Inf Syst, 26 (3), pp. 1-46
dc.description	Marsono, M., El-Kharashi, N., Gebali, F., Targeting spam control on middleboxes: spam detection based on layer-3 e-mail content classification (2009) Comput Netw, 53 (6), pp. 835-848
dc.description	Matthews, B., Comparison of the predicted and observed secondary structure of T4 phage lysozyme (1975) Biochim Biophys Acta, 405 (2), pp. 442-451
dc.description	McCallum, A., Nigam, K., A comparison of event models for Naive Bayes text classification (1998) Proceedings of the 15th AAAI workshop on learning for text categorization, pp. 41-48. , Menlo Park, CA, USA
dc.description	Metsis, V., Androutsopoulos, I., Paliouras, G., Spam filtering with Naive Bayes-which Naive Bayes (2006) Proceedings of the 3rd international conference on email and anti-spam, pp. 1-5. , Mountain View, CA, USA
dc.description	Mitchell, T., (1997) Machine learning, , McCraw-Hill, New York
dc.description	Sahami, M., Dumais, S., Hecherman, D., Horvitz, E., A Bayesian approach to filtering junk e-mail (1998) Proceedings of the 15th national conference on artificial intelligence, pp. 55-62. , Madison, WI, USA
dc.description	Schapire, R., Singer, Y., Singhal, A., Boosting and Rocchio applied to text filtering (1998) Proceedings of the 21st annual international conference on information retrieval, pp. 215-223. , Melbourne, Australia
dc.description	Schneider, K., A comparison of event models for Naive Bayes anti-spam e-mail filtering (2003) Proceedings of the 10th conference of the European chapter of the association for computational linguistics, pp. 307-314. , Budapest, Hungary
dc.description	Schneider, K., On word frequency information and negative evidence in Naive Bayes text classification (2004) Proceedings of the 4th international conference on advances in natural language processing, pp. 474-485. , Alicante, Spain
dc.description	Sebastiani, F., Machine learning in automated text categorization (2002) ACM Comput Surv, 34 (1), pp. 1-47
dc.description	Seewald, A., An evaluation of Naive Bayes variants in content-based learning for spam filtering (2007) Int Data Anal, 11 (5), pp. 497-524
dc.description	Song, Y., Kolcz, A., Gilez, C., Better Naive Bayes classification for high-precision spam detection (2009) Softw Pract Exp, 39 (11), pp. 1003-1024
dc.description	Van Rijsbergen, C., (1979) Information retrieval, , 2nd edn. Butterworths, London
dc.description	Yang, Y., Pedersen, J., A comparative study on feature selection in text categorization (1997) Proceedings of the 14th international conference on machine learning, pp. 412-420. , Nashville, TN, USA
dc.description	Zadeh, L., Fuzzy sets (1965) Inf Control, 8 (3), pp. 338-353
dc.language	en
dc.publisher
dc.relation	Journal of Internet Services and Applications
dc.rights	aberto
dc.source	Scopus
dc.title	Spam Filtering: How The Dimensionality Reduction Affects The Accuracy Of Naive Bayes Classifiers
dc.type	Artículos de revistas

Este ítem pertenece a la siguiente institución

Universidade Estadual de Campinas (Brasil)