Detección de URLs maliciosas por medio de técnicas de aprendizaje automático

Céspedes Maestre, María Martha

dc.contributor	Salcedo Parra, Octavio José
dc.contributor	Salazar Herrera, Carlos Alberto
dc.creator	Céspedes Maestre, María Martha
dc.date.accessioned	2021-06-25T03:01:19Z
dc.date.available	2021-06-25T03:01:19Z
dc.date.created	2021-06-25T03:01:19Z
dc.date.issued	2021-05
dc.identifier	https://repositorio.unal.edu.co/handle/unal/79722
dc.identifier	Universidad Nacional de Colombia
dc.identifier	Repositorio Institucional Universidad Nacional de Colombia
dc.identifier	https://repositorio.unal.edu.co/
dc.description.abstract	En la actualidad, los ciberdelincuentes perpetran ataques web de forma sencilla, en los que aplican diferentes vectores para poner en peligro la seguridad de la información y en los que entienden al ser humano como un flanco fácil para lograr sus objetivos. Generalmente, los usuarios de internet deben realizar una acción que permita el éxito del ataque, por ejemplo, dar clic a alguna URL. Es por lo anterior, que muchos esfuerzos están dirigidos a encontrar técnicas que mitiguen esta problemática y se apuestan grandes cantidades de dinero en generar soluciones. Tomando como referencia el uso de listas negras, la clasificación heurística, y, prestando especial atención a las técnicas de aprendizaje automático capaces de detectar ataques de día cero, en el presente trabajo se despliega un diseño de detección de URLs maliciosas, haciendo uso de criterios léxicos y de ofuscación de la URL. Estas, clasificadas por medio de técnicas de aprendizaje automático como Logistic Regression, Support Vector Machine y Random Forest; demostrando que los tres clasificadores implementados mantienen una relación de eficacia y rendimiento con porcentajes de precisión del 98%, y, tiempos de respuesta satisfactorio. Es preciso aclarar que Random Forest puede estar sujeto a mejoras, ya que se pretende detectar de manera automática las URLs maliciosas y este clasificador tarda en promedio 16 segundos en hacerlo. Como resultado general del diseño, se obtiene un modelo de libre distribución que puede ser utilizado de forma masiva por diferentes usuarios en la red, capaz de detectar de forma precisa URLs maliciosas.
dc.description.abstract	Today, cybercriminals carry out web attacks in a simple way, in which they apply different vectors to endanger information security and in which they understand the human being as an easy flank to achieve their objectives. Generally, Internet users must take an action that allows the attack to succeed, for example, clicking on a URL. This is why many efforts are aimed at finding techniques that mitigate this problem and large amount of money are bet on generating solutions. Taking as a reference the use of blacklists, heuristic classification, and, paying special attention to machine learning techniques capable of detecting zero-day attacks, in this work a design for detecting malicious URLs is deployed, making use of criteria Lexical and URL obfuscation. These, classified by means of machine learning techniques such as Logistic Regression, Support Vector Machine and Random Forest; demonstrating that the three implemented classifiers maintain an efficiency and performance ratio with 98% accuracy percentages, and satisfactory response times. It should be clarified that Random Forest may be subject to improvements, since it is intended to automatically detect malicious URLs and this classifier takes an average of 16 seconds to do so. As a general result of the design, a free distribution model is obtained that can be used an masse by different users on the network, capable of accurately detecting malicious URLs.
dc.language	spa
dc.publisher	Universidad Nacional de Colombia
dc.publisher	Bogotá - Ingeniería - Maestría en Ingeniería - Telecomunicaciones
dc.publisher	Departamento de Ingeniería de Sistemas e Industrial
dc.publisher	Facultad de Ingeniería
dc.publisher	Bogotá, Colombia
dc.publisher	Universidad Nacional de Colombia - Sede Bogotá
dc.relation	[API navegación segura Google, 2010]API de navegación segura de Google: Google Code ,2010, [en línea] Disponible: http://code.google.com/apis/safebrowsing/
dc.relation	[Akiyama et al., 2017] Akiyama, M., Yagi, T., Yada, T., Mori, T., & Kadobayashi, Y. (2017). Analyzing the ecosystem of malicious URL redirection through longitudinal observation from honeypots. Computers & Security, 69, 155–173. doi:10.1016/j.cose.2017.01.003
dc.relation	[Bahnsen et al., 2017] Bahnsen, Alejandro Correa; Bohorquez, Eduardo Contreras; Villegas, Sergio; Vargas, Javier; Gonzalez, Fabio A. (2017). [IEEE 2017 APWG Symposium on Electronic Crime Research (eCrime) - Pheonix, AZ, USA (2017.4.25-2017.4.27)] 2017 APWG Symposium on Electronic Crime Research (eCrime) - Classifying phishing URLs using recurrent neural networks. , (), 1–8.doi:10.1109/ECRIME.2017.7945048.
dc.relation	[Basit et al., 2020] Basit, A., Zafar, M., Liu, X. Una encuesta completa de las técnicas de detección de ataques de phishing habilitadas por IA. Telecommun Syst (2020). doi:10.1007/s11235-020-00733-2
dc.relation	[Berners-Lee, et al., 1994] Berners-Lee, T., Masinter, L., McCahil, M. (1994). “Uniform Resource Locators (URL)” , RFC 1738, diciembre de 1994
dc.relation	[Berners-Lee, 1994] Berners-Lee, T. (1994). “Universal Resource Identifiers in WWW: A Sintaxis unificadora para la expresión de nombres y direcciones de Objetos en la red utilizados en la World-Wide Web”, RFC 1630, CERN, junio de 1994.
dc.relation	[Berners-Lee, 2005] Berners-Lee, T. (2005). “Uniform Resource Identifier (URI): Generic Syntax”, RFC 3986, CERN, enero de 2005.
dc.relation	[Bezzera and Feitosa, 2015] Bezzera, M. Feitosa, E., (2015). Investigando o uso de Características na Detecção de URLs Maliciosas. XV Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais — SBSeg 2015.
dc.relation	[Bhagwat et al., 2019] Bhagwat, A., Lodhi, K., Dalvi, S., & Kulkarni, U. (2019). An implemention of a mechanism for malicious URLs detection. In Proceedings of the 2019 6th International Conference on Computing for Sustainable Global Development, INDIACom 2019 (pp. 1008–1013). Institute of Electrical and Electronics Engineers Inc.
dc.relation	[Breiman, 2001] Breiman, L. (2001). Machine Learning, 45(1), 5–32. doi:10.1023/a:1010933404324
dc.relation	[Burgess et al., 2020] J. Burgess, D. Carlin, P. O'Kane y S. Sezer, "REdiREKT: Extracting Malicious Redirections from Exploit Kit Traffic", Conferencia IEEE de 2020 sobre comunicaciones y seguridad de redes (CNS) , Avignon, Francia, 2020, págs. 1-9, doi: 10.1109 / CNS48642.2020.9162304.
dc.relation	[Cheng and Greiner, 1999] Cheng, J. & Greiner, R. (1999). Comparing Bayesian Network Classifiers. UAI'99: Proceedings of the Fifteenth conference on Uncertainty in artificial intelligence, July 1999 Pages 101–108
dc.relation	[Chen et al., 2019] Chen, W., Zeng, Y., & Qiu, M. (2019). Using Adversarial Examples to Bypass Deep Learning Based URL Detection System. 2019 IEEE International Conference on Smart Cloud (SmartCloud). doi:10.1109/smartcloud.2019.00031
dc.relation	[Chiew et al., 2018] Chiew, K. L., Yong, K. S. C., & Tan, C. L. (2018). A survey of phishing attacks: Their types, vectors and technical approaches. Expert Systems with Applications, 106, 1–20. doi: 10.1016/j.eswa.2018.03.050
dc.relation	[Cisco, 2018] Reporte Anual de Ciberseguridad,2018, [en línea] Disponible: https://www.cisco.com/c/dam/global/es_mx/solutions/pdf/reporte-anual-cisco-2018-espan.pdf
dc.relation	[Cutler and Zhao, 2001] Cutler, A., & Zhao, G. (2001). PERT – Perfect Random Tree Ensembles.
dc.relation	[Das et al., 2019] Das, A., Baki, S., El Aassal, A., Verma, R., & Dunbar, A. (2019). SoK: A Comprehensive Reexamination of Phishing Research from the Security Perspective. IEEE Communications Surveys & Tutorials, 1–1. doi:10.1109/comst.2019.2957750
dc.relation	[DNS-BH, s.f.] DNS-BH malware Domains. (s.f.). DNS-BH Malware Domains Blocklist by RiskAnalytics http://mirror1.malwaredomains.com/files/domains.txt
dc.relation	[Elwell and Polikar, 2011] Elwell, R., & Polikar, R. (2011). Incremental Learning of Concept Drift in Nonstationary Environments. IEEE Transactions on Neural Networks, 22(10), 1517–1531. doi:10.1109/tnn.2011.2160459
dc.relation	Fayrix, s.f.] Fayrix. (s.f.). “Selección de métricas para aprendizaje automático”. https://fayrix.com/machine-learning-metrics_es
dc.relation	[Filtro SmartScreen Microsoft, 2011] Filtro SmartScreen - Microsoft Windows, 2011, [en línea] Disponible: https://support.microsoft.com/es-us/help/17443/windows-internet-explorer-smartscreen-faq
dc.relation	[Friedman et al., 1997] Friedman, N., Geiger, D. & Goldszmidt, M. Bayesian Network Classifiers. Machine Learning 29, 131–163 (1997). doi: 10.1023/A:1007465528199
dc.relation	[Garera et al., 2007] Garera, S., Provos, N., Chew, M., & Rubin, A. D. (2007). A framework for detection and measurement of phishing attacks. Proceedings of the 2007 ACM Workshop on Recurring Malcode - WORM ’07. doi:10.1145/1314389.1314391
dc.relation	[Ghafir and Prenosil, 2015] Ghafir, I., & Prenosil, V. (2015). Blacklist-based malicious IP traffic detection. 2015 Global Conference on Communication Technologies (GCCT). doi:10.1109/gcct.2015.7342657
dc.relation	[Goodfellow et al., 2016] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning [Libro electrónico]. MIT Press. http://www.deeplearningbook.org
dc.relation	[Gowtham et al., 2014] Gowtham, R.; Krishnamurthi, Ilango (2014). A comprehensive and efficacious architecture for detecting phishing webpages. Computers & Security, 40(), 23–37. doi:10.1016/j.cose.2013.10.004
dc.relation	[Kan and Thi, 2005] Kan, M.-Y., & Thi, H. O. N. (2005). Fast webpage classification using URL features. Proceedings of the 14th ACM International Conference on Information and Knowledge Management - CIKM ’05. doi:10.1145/1099554.1099649
dc.relation	[Khonji et al.,2013] Khonji, M., Iraqi, Y., & Jones, A. (2013). Phishing Detection: A Literature Survey. IEEE Communications Surveys & Tutorials, 15(4), 2091–2121. doi:10.1109/surv.2013.032213.00009
dc.relation	[Khor et al., 2010] Khor, K. C., Ting, C. Y., & Phon-Amnuaisuk, S. (2010). Comparing Single and Multiple Bayesian Classifiers Approaches for Network Intrusion Detection. 2010 Second International Conference on Computer Engineering and Applications. doi:10.1109/iccea.2010.214
dc.relation	[Kim et al.,2018] Kim, S., Kim, J., & Kang, B. B. (2018). Malicious URL protection based on attackers’ habitual behavioral analysis. Computers & Security. doi: 10.1016/j.cose.2018.01.013
dc.relation	[Kolosnjaji et al., 2016] Kolosnjaji, B., Zarras, A., Webster, G., & Eckert, C. (2016). Deep Learning for Classification of Malware System Call Sequences. Lecture Notes in Computer Science, 137–149. doi:10.1007/978-3-319-50127-7_11
dc.relation	[Kühnel and Meyer, 2016] Kühnel, M., & Meyer, U. (2016). Applying highly space efficient blacklisting to mobile malware. Logic Journal of IGPL, 24(6), 971–981. doi:10.1093/jigpal/jzw052
dc.relation	[Landwehr et al., 2005] Landwehr N., Hall M., & Frank E. (2005). Logistic Model Trees. Machine Learning, 59, 161–205, 2005
dc.relation	[Latorre, 2018] Latorre M. (2018) Universidad Marcelino Champagnat, [en línea]. Disponible: http://umch.edu.pe/arch/hnomarino/74_Historia%20de%20la%20Web.pdf
dc.relation	[Lee and Kim, 2013] Lee, S., & Kim, J. (2013). Fluxing botnet command and control channels with URL shortening services. Computer Communications, 36(3), 320–332. doi: 10.1016/j.comcom.2012.10.003
dc.relation	[Lin et al., 2013] Lin, M.-S., Chiu, C.-Y., Lee, Y.-J., & Pao, H.-K. (2013). Malicious URL filtering — A big data application. 2013 IEEE International Conference on Big Data. doi:10.1109/bigdata.2013.6691627
dc.relation	[Ma et al., 2009] Ma, J., Saul, L. K., Savage, S., & Voelker, G. M. (2009). Beyond blacklists. Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ’09. doi:10.1145/1557019.1557153
dc.relation	[Majestic, s.f.] Majestic. (s.f.). The Majestic Million (Formato CSV) https://majestic.com/reports/majestic-million
dc.relation	[Mamun et al., 2016] Mamun, M. S. I., Rathore, M. A., Lashkari, A. H., Stakhanova, N., & Ghorbani, A. A. (2016). Detecting Malicious URLs Using Lexical Analysis. Network & System Security (9783319462974), 467
dc.relation	[Manjeri et al., 2019] Manjeri, A. S., R, K., MNV, A., & Nair, P. C. (2019). A Machine Learning Approach for Detecting Malicious Websites using URL Features. 2019 3rd International Conference on Electronics, Communication and Aerospace Technology (ICECA). doi:10.1109/iceca.2019.8821879
dc.relation	[Mohammad et al., 2015] Mohammad, R. M., Thabtah, F., & McCluskey, L. (2015). Tutorial and critical analysis of phishing websites methods. Computer Science Review, 17, 1–24. doi: 10.1016/j.cosrev.2015.04.001
dc.relation	[Nguyen et al., 2013] Nguyen, L. A. T., To, B. L., Nguyen, H. K., & Nguyen, M. H. (2013). Detecting phishing web sites: A heuristic URL-based approach. 2013 International Conference on Advanced Technologies for Communications (ATC 2013). doi:10.1109/atc.2013.6698185
dc.relation	[OpenPhish, s.f.] OpenPhish. (s.f.). “Actividad global de phishing”. Phishing Feeds. https://openphish.com/
dc.relation	[Pedregosa et al., 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
dc.relation	[PhishTank, s.f.] PhishTank. (s.f.). Developers. “Developer Information”. (Formato CSV) https://www.phishtank.com/
dc.relation	[PhishTank, 2006] PhishTank,2006 [en línea] Disponible: http://phishtank.org/
dc.relation	[Prakash et al., 2010] Prakash, P., Kumar, M., Kompella, R. R., & Gupta, M. (2010). PhishNet: Predictive Blacklisting to Detect Phishing Attacks. 2010 Proceedings IEEE INFOCOM. doi:10.1109/infcom.2010.5462216
dc.relation	[Python, 2021.] Python (24 de febrero 2021). “What´s New In Python 3.9”. Python. https://docs.python.org/3.9/whatsnew/3.9.html
dc.relation	[Python, 2021.] Python (24 de febrero 2021). “re-Operaciones con expresiones regulares”. Python. https://docs.python.org/es/3/library/re.html
dc.relation	[Python, 2021.] Python (24 de febrero 2021). “math-Funciones matemáticas”. Python. https://docs.python.org/es/3.10/library/math.html
dc.relation	[Python, 2021.] Python (24 de febrero 2021). “datatime-Tipos básico de fecha y hora”. Python. https://docs.python.org/es/3/library/datetime.html
dc.relation	[Python, 2021.] Python (24 de febrero 2021). “Collections – Tipos de datos de contenedor”. Python. https://docs.python.org/3/library/collections.html
dc.relation	[Revista Dinero (7 de abril 2019)]“4 de cada 10 empresas en América Latina sufrieron ciberataques en los últimos años”, 2019[En Línea]Disponible : https://www.dinero.com/tecnologia/articulo/empresas-en-colombia-sufren-de-ataques-ciberneticos-regularmente/273870
dc.relation	[Schapire and Freund, 2012 ] Schapire, R. E., & Freund, Y. (2012). Boosting: Foundations and algorithms. ProQuest Ebook Central https://ebookcentral-proquest-com.ezproxy.javeriana.edu.co
dc.relation	[ScikitLearn, s.f.] ScikitLearn. (s.f.). Extracción de características. “Extracción de características de texto”https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction
dc.relation	[ScikitLearn, s.f.] ScikitLearn. (s.f.). Scikit-learn. “Machine Learning in Python” https://scikit-learn.org/stable/
dc.relation	[Shai and Shai, 2014, Capitulo 9, p 125] Shai Shalev-Shawart and Shai Ben-David (2014). Undersanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
dc.relation	[Shai and Shai, 2014, Capitulo 15, p 200] Shai Shalev-Shawart and Shai Ben-David (2014). Undersanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
dc.relation	[Shai and Shai, 2014, Capitulo 18, p 250] Shai Shalev-Shawart and Shai Ben-David (2014). Undersanding Machine Learning: From Theory to Algorithms. Cambridge University Press. https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/
dc.relation	[Shubho et al., 2019] Shubho, S. A., Razib, M. R. H., Rudro, N. K., Saha, A. K., Khan, M. S. U., & Ahmed, S. (2019). Performance Analysis of NB Tree, REP Tree and Random Tree Classifiers for Credit Card Fraud Data. 2019 22nd International Conference on Computer and Information Technology (ICCIT). doi:10.1109/iccit48885.2019.9038578
dc.relation	[Silva et al., 2019] Silva, C. M. R. da, Feitosa, E. L., & Garcia, V. C. (2019). Heuristic-based Strategy for Phishing Prediction: A Survey of URL-based approach. Computers & Security, 101613. doi: 10.1016/j.cose.2019.101613
dc.relation	[Singh and Goyal, 2019] Singh, A. K., & Goyal, N. (2019). A Comparison of Machine Learning Attributes for Detecting Malicious Websites. 2019 11th International Conference on Communication Systems & Networks (COMSNETS). doi:10.1109/comsnets.2019.8711133
dc.relation	[Stackoverflow, 2020.] Stackoverflow. (2020). Fayrix. “Tecnologías mas populares”. https://insights.stackoverflow.com/survey/2020#most-popular-technologies
dc.relation	[UNB, s.f.] UNB: University of New Brunswick. (s.f.). “Canadian Institute for Cybersecurity”. URL dataset (ISCX-URL2016). https://www.unb.ca/cic/datasets/url-2016.html
dc.relation	[Vanhoenshoven et al., 2016] Vanhoenshoven, F., Napoles, G., Falcon, R., Vanhoof, K., & Koppen, M. (2016). Detecting malicious URLs using machine learning techniques. 2016 IEEE Symposium Series on Computational Intelligence (SSCI). doi:10.1109/ssci.2016.7850079
dc.relation	[Varoquaux et al. 2015] Varoquaux, G., Buitinck, L., Louppe, G., Grisel, O., Pedregosa, F., & Mueller, A. (2015). Scikit-learn. GetMobile: Mobile Computing and Communications, 19(1), 29–33. doi:10.1145/2786984.2786995
dc.relation	[Vazhayil et al., 2018] Vazhayil, A., Vinayakumar, R., & Soman, K. (2018). Comparative Study of the Detection of Malicious URLs Using Shallow and Deep Networks. 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT). doi:10.1109/icccnt.2018.8494159
dc.relation	[Verma et al., 2015] Verma, R., Kantarcioglu, M., Marchette, D., Leiss, E., & Solorio, T. (2015). Security Analytics: Essential Data Analytics Knowledge for Cybersecurity Professionals and Students. IEEE Security & Privacy, 13(6), 60–65. doi:10.1109/msp.2015.121
dc.relation	[Verma and Dyer, 2015] Verma, R., & Dyer, K. (2015). On the Character of Phishing URLs. Proceedings of the 5th ACM Conference on Data and Application Security and Privacy - CODASPY ’15. doi:10.1145/2699026.2699115
dc.relation	[Wainberg et al., 2016] Wainberg, M., Alipanahi, B., & Frey, B. J. (2016). Are random forests truly the best classifiers? Journal of Machine Learning Research, 17, 1–5
dc.relation	[W3C, s.f.] W3C, World Wide Web Consortium (s.f.). Arquitectura de la World Wide Web, volumen uno [en línea] Disponible: https://www.w3.org/TR/webarch/
dc.relation	[Yuan et al., 2014] Yuan, Z., Lu, Y., Wang, Z., & Xue, Y. (2014). Droid-Sec. ACM SIGCOMM Computer Communication Review, 44(4), 371–372. doi:10.1145/2740070.2631434
dc.relation	[Zhang et al., 2008] Zhang, J., Porras, P., & Ullrich, J. (2008). Highly predictive blacklisting. In Proceedings of the 17th USENIX Security Symposium (pp. 107–122). USENIX Association.
dc.relation	[Zhang et al., 2011] Zhang, W., Ding, Y.-X., Tang, Y., & Zhao, B. (2011). Malicious web page detection based on on-line learning algorithm. 2011 International Conference on Machine Learning and Cybernetics. doi:10.1109/icmlc.2011.6016954
dc.relation	[Zhao et al., 2018] Zhao, J., Wang, N., Ma, Q., & Cheng, Z. (2018). Classifying Malicious URLs Using Gated Recurrent Neural Networks. Advances in Intelligent Systems and Computing, 385–394. doi:10.1007/978-3-319-93554-6_36
dc.rights	Atribución-NoComercial-SinDerivadas 4.0 Internacional
dc.rights	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Derechos reservados de autor, 2021
dc.title	Detección de URLs maliciosas por medio de técnicas de aprendizaje automático
dc.type	Trabajo de grado - Maestría

Este ítem pertenece a la siguiente institución

Universidad Nacional de Colombia