Diseño de una estrategia de limpieza y estandarización de direcciones postales a través de redes neurales recurrentes tipo LSTM

Ceballos Gallego, Santiago

dc.contributor	González, Fabio Augusto
dc.contributor	Mindlab
dc.creator	Ceballos Gallego, Santiago
dc.date.accessioned	2021-01-26T15:23:07Z
dc.date.available	2021-01-26T15:23:07Z
dc.date.created	2021-01-26T15:23:07Z
dc.date.issued	2020-12-09
dc.identifier	Ceballos, S. (2020). Diseño de una estrategia de limpieza y estandarización de direcciones postales a través de redes neurales recurrentes tipo LSTM [Tesis de maestría, Universidad Nacional de Colombia]. Repositorio Institucional.
dc.identifier	https://repositorio.unal.edu.co/handle/unal/78923
dc.description.abstract	Las direcciones geográficas son uno de los elementos más comunes en las bases de datos de diferentes tipos de organizaciones. Sin embargo, el registro de dichas direcciones se realiza, a menudo, de forma manual y sin un formato de referencia, lo que da lugar a múltiples representaciones de los elementos que componen la dirección. Esto, a su vez, genera que el registro sea usualmente inutilizable para fines de geolocalización automática, un área cada vez más relevante en los principales sectores de la economía. En el presente documento se propone una metodología para la limpieza y estandarización de direcciones geográficas, basada en redes neuronales recurrentes tipo LSTM, como solución a este problema. Dicha metodología, incluye la estrategia de generación de un conjunto de datos sintético, para el entrenamiento de la red, que está compuesto por direcciones no estructuradas y las direcciones equivalentes en formato estándar. El desempeño del modelo se mide en dos conjuntos de datos diferentes: El primero contiene 10000 direcciones sintéticas sucias y su equivalente limpio, contra el cual se compara la dirección genearada utilizando los índices de Jaccard, Jaro y Levenshtein, como medidas de similitud; el segundo, contiene 5000 direcciones reales de establecimientos comerciales en las tres principales ciudades de Colombia, para los cuales se cuenta con la geolocalización exacta. Esta ubicación real se compara con la obtenida tras geolocalizar la dirección resultante del proceso de estandarización. Al aplicar esta estrategia, se evidencia una mejora significativa tanto en la precisión del formato estándar obtenido, como en la geolocalización de la dirección resultante, cuando se compara contra los dos modelos base más utilizados en este campo: el modelo basado en reglas de limpieza y el modelo basado en cadenas de Markov ocultas. Por ´ultimo, se muestran aplicaciones de la metodología para limpieza y geolocalización de direcciones tomadas de una base de datos real, en ´ámbitos como la optimización de fuerza de ventas, la atención al cliente y el mercadeo digital.
dc.description.abstract	Postal addresses are one of the most common elements in current organizations’ databases. However, the register of these addresses is usually made in a manual way and not following any standard format, which may result in multiple representations for items in the address (e.g., street, avenue, apartment number, etc.) and therefore hindering the efforts to take value out of those registers. In this document we proposed a cleansing and standardization methodology for postal addresses, based on Long-Short-Term Memory (LSTM) neural networks. It includes the strategy to generate synthetic registers used for training purposes and composed of non-structured addresses and their equivalents in standard format. We measure model performance using two different data sets. First data set contains up to 10000 registers of new synthetic non-standard addresses with their clean equivalent, which is compared with the result of the model using Jaro, Jaccard and Levenshtein indexes as similarity measures. The second data set contains 5000 real addresses (anonymized) from commercial establishments, located in three main cities in Colombia as well as their real locations, which are compared against geolocation obtained from the clean address given by the model. The proposed methodology is shown to make a significant improvement in both, the accuracy of the string text obtained versus the expected standard format, and the geolocation obtained; when compared with the main strategies used for this purpose: rules-based models and Hidden Markov models. We also present some real applications of the proposed strategy in diverse areas such as sales routes optimization, digital marketing and customer service.
dc.language	spa
dc.publisher	Bogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación
dc.publisher	Universidad Nacional de Colombia - Sede Bogotá
dc.relation	V. Borkar, K. Deshmukh, and S. Sarawagi, “Automatic segmentation of text into structured records,” ACM SIGMOD Record, vol. 30, no. 2, pp. 175–186, 2001.
dc.relation	D.¨u¸c¨uk Matci@ and U. Avdan, “Address standardization using the natural language process for improving geocoding results,” Computers, Environment and Urban Systems, vol. 70, no. February, pp. 1–8, 2018.
dc.relation	G. Sharma, Shikhar; Ratti, Ritesh; arora, Ishaan; Solanki, Anshul,; Bhatt, “Automated Parsing of Geographical Addresses : A Multilayer Feedforward Neural Network based approach,” in IEEE international Conference on Semantic Computing, pp. 123–130, 2018
dc.relation	O. F. I. Pach´on Quevedo and S. I. Tellez, “Propuesta de Est´andar de las Direcciones Urbanas para los Equipamientos del Ministerio de Educaci´on,” p. 42, 2009.
dc.relation	D. W. Goldberg, J. N. Swift, and J. P. Wilson, “Address Standardization,” Tech. Rep. 12, 2017.
dc.relation	K. Malik, Muhammad Noman; Abdul, “Address Standardization using Supervised Machine Learning,” in 2011 International Conference on Computer Communication and Management, no. November, 2015
dc.relation	V. Borkar, K. Deshmukh, and S. Sarawagi, “Automatically extracting structure from free text addresses,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 27–32, 2000.
dc.relation	I. Mulasastra and A. Taplaksint, “Elementization of Thai Postal Addresses : A Hybrid Approach,” in 2015 IEEE International Conference on Electrical and Computer Engineering (WIECON-ECE), 2015.
dc.relation	G. Kothari, T. A. Faruquie, L. V. Subramaniam, K. H. Prasad, and M. K. Mohania, “Transfer of supervision for improved address standardization,” Proceedings - International Conference on Pattern Recognition, pp. 2178–2181, 2010.
dc.relation	D.¨u¸c¨uk Matci@ and U. Avdan, “Address standardization using the natural language process for improving geocoding results,” Computers, Environment and Urban Systems, vol. 70, no. January 2017, pp. 1–8, 2018.
dc.relation	M. N. Masrek and Z. A. Razak, \Malaysian address semantic: The process of standardization," 2nd International Conference on Computer Research and Development, ICCRD 2010, pp. 77{80, 2010.
dc.relation	G. K. Tanveer, A. F. L. Venkata, S. K. Hima, and P. Mukesh, \Transfer of supervision for improved address standardization," in 2010 International Conference on Pattern Recognition, pp. 2182{2185, 2010.
dc.relation	Informatica, \Address Validation Best Practices for Interpreting and AnalizingAddress Data Quality Results," 2013.
dc.relation	Runner enterprise Data Quality, \ADDRESS DATA CLEANSING: A BETTER APPROACH," 2017.
dc.relation	R. A. Abbasi, \Information Extraction Techniques for Postal Address Standardization," Faculty of Computing - Riphap International University, 2005.
dc.relation	C. Lin, K. Choy, G. Ho, S. Chung, and H. Lam, \Survey of Green Vehicle Routing Problem: Past and future trends," Expert Systems with Applications, vol. 41, pp. 1118{1138, mar 2014.
dc.relation	H. Jafari, \e-Commerce Logistics ^a\ Contemporary Literature," 2018 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 1196{1200, 2018.
dc.relation	P. Christen, T. Churches, and A. Willmore, \A probabilistic geocoding system based on a national address _le," Proceedings of the 3rd Australasian Data Mining Conference, Cairns, 2004.
dc.relation	P. Rogerson, D. Han, J. L. Freudenheim, J. E. Vena, M. R. Bonner, and J. Nie, \Positional Accuracy of Geocoded Addresses in Epidemiologic Research," Epidemiology, vol. 14, no. 4, pp. 408{412, 2004.
dc.relation	S. A. Collier, L. J. Stockman, L. A. Hicks, L. E. Garrison, F. J. Zhou, and M. J. Beach, \Direct healthcare costs of selected diseases primarily or partially transmitted by water.," Epidemiology and infection, vol. 140, pp. 2003{13, nov 2012.
dc.relation	M. R. Cayo and T. O. Talbot, \Positional error in automated geocoding of residential addresses," International Journal of Health Geographics, vol. 2, pp. 1{12, 2003.
dc.relation	C. A. Davis and F. T. Fonseca, \Assessing the certainty of locations produced by an address geocoding system," GeoInformatica, vol. 11, no. 1, pp. 103{129, 2007.
dc.relation	J. H. Ratcli_e, \Geocoding crime and a _rst estimate of a minimum acceptable hit rate," International Journal of Geographical Information Science, vol. 18, pp. 61{72, jan 2004.
dc.relation	D. P. Johnson, A. Stanforth, V. Lulla, and G. Luber, \Developing an applied extreme heat vulnerability index utilizing socioeconomic and environmental data," Applied Geography, vol. 35, pp. 23{31, nov 2012.
dc.relation	SmartyStreets, \USPS & International Address Veri_cation - SmartyStreets."
dc.relation	egon: Address Quality, \EGON - Company informations," 2019.
dc.relation	EXPERIAN, \Address validation from Experian QAS," 2018.
dc.relation	M. Wang, V. Haberland, A. Yeo, A. Martin, J. Howroyd, and J. M. Bishop, \A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar," IEEE International Conference on Data Mining Workshops, ICDMW, pp. 225{232, 2017.
dc.relation	R. G. Crowder, Principles of Learning and Memory: Classic Edition, vol. 2014. 2014. [30] N. Reimers and I. Gurevych, \Reporting Score Distributions Makes a Di_erence: Performance Study of LSTM-networks for Sequence Tagging Nils," in Ubiquitous Knowledge Processing Lab (UKP-DIPF), 2017.
dc.relation	E. Ma, Xuezhe; Hovy, \End-to-end Sequence Labeling via Bi-directional LSTM-CNNs- CRF," in Language Tecnologies Institute, 2016.
dc.relation	G. Xiang, Bing; Kurata, \Leveraging Sentence-level Information with Encoder LSTM for Semantic Slot Filling," in IBM Research, 2016.
dc.relation	B. Liu and I. Lane, \Attention-Based Recurrent Neural Network Models for Joint Intent Detection and Slot Filling."
dc.relation	J. P. C. Chiu and E. Nichols, \Named Entity Recognition with Bidirectional LSTMCNNs," in University of British Columbia; Honda Research Institute Japan CO,no. 2003, 2014.
dc.relation	F. Xu, G. Yi, W. Qi, and F. Zhen, \Research on Automatic Summary of Chinese Short Text Based on LSTM and Keywords Correction *," in Tenth International Conference on Advanced Computational Intelligence (ICACI), no. 17, pp. 467{472, 2018.
dc.relation	S. Pascual and A. Bonafonte, \Multi-output RNN-LSTM for multiple speaker speech synthesis and adaptation," in European Signal Processing Conference (EUSIPCO),pp. 2325{2329, 2016
dc.relation	D. Wei, B. Wang, G. Lin, D. Liu, Z. Dong, H. Liu, and Y. Liu, \Research on Unstructured Text Data Mining and Fault Classi_cation Based on RNN-LSTM with MalfunctionInspection Report," Energies, vol. 10, no. 406, 2017.
dc.relation	K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, \SPOKEN LANGUAGE UNDERSTANDING USING LONG SHORT-TERM MEMORY NEURAL NETWORKS,"in Microsoft, pp. 189{194, 2014.
dc.relation	O. Morillot, L. Likforman-Sulem, and E. Grosicki, \New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks," Journal of Electronic Imaging, vol. 22, no. 2, p. 023028, 2013.
dc.relation	M.-T. Luong, H. Pham, and C. D. Manning, \E_ective Approaches to Attention-based Neural Machine Translation," 2015.
dc.relation	T. Chen, R. Xu, Y. He, and X. Wang, \Improving sentiment analysis via sentence type classi_cation using BiLSTM-CRF and CNN," Expert Systems With Applications, vol. 72, pp. 221{230, 2017.
dc.relation	I. Sutskever, O. Vinyals, and Q. V. Le, \Sequence to sequence learning with neural networks," Advances in Neural Information Processing Systems (NIPS), pp. 3104{3112, 2014.
dc.relation	G. Lewis, \Sentence Correction using Recurrent Neural Networks," pp. 1{7, 2015.
dc.relation	J. Martens, \Generating Text with Recurrent Neural Networks," Neural Networks, vol. 131, no. 1, pp. 1017{1024, 2011.
dc.relation	J. Li, K. Ouazzane, H. B. Kazemian, and M. S. Afzal, \Neural network approaches for noisy language modeling," IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 11, pp. 1773{1784, 2013.
dc.relation	S. Zhu and K. Yu, \ENCODER-DECODER WITH FOCUS-MECHANISM FOR SEQUENCE LABELLING BASED SPOKEN LANGUAGE UNDERSTANDING," Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering SpeechLab , Department of Computer Scie, pp. 5675{5679, 2017.
dc.relation	F. Liu, T. M. Hospedales, W. Yang, and C. Sun, \Semantic Regularisation for Recurrent Image Annotation," in Computer Vision Foundation, 2016.
dc.relation	L. Liu, J. Shang, X. Ren, F. F. Xu, H. Gui, J. Peng, and J. Han, \Empower Sequence Labeling with Task-Aware Neural Language Model," 2017.
dc.relation	E. Alpayding, Introduction to Machine Learning Second Edition, vol. 1107. 2010.
dc.relation	D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction. John Wiley & Sons, Ltd, aug 2001.
dc.relation	B. V. Merri, \Learning Phrase Representations using RNN Encoder^a\Decoder for Statistical Machine Translation," 2013.
dc.relation	J. Hochreiter, Sepp; Schmidhuber, \LONG SHORT-TERM MEMORY," Neural Computation, vol. 9, no. 8, pp. 1{32, 1997.
dc.relation	Google Inc, \Google Maps Platform."
dc.relation	OpenStreetMap, \Researcher Information OpenStreetMap," 2017.
dc.relation	W. Cohen, P. Ravikumar, and S. Fienberg, \A Comparison of String Distance Metrics for Name-Matching Tasks William," Software: Practice and Experience, vol. 12, no. 1,pp. 57{66, 2003.
dc.relation	P. Achananuparp, X. Hu, and X. Shen, \The evaluation of sentence similarity measures," Lecture Notes in Computer Science (including subseries Lecture Notes in Arti_cial Intelligence and Lecture Notes in Bioinformatics), vol. 5182 LNCS, pp. 305{316, 2008.
dc.relation	T. Kohonen and P. Somervuo, \Self-organizing maps of symbol strings," Neurocomputing, vol. 21, no. 1-3, pp. 19{30, 1998.
dc.relation	S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, \Using of jaccard coefcient for keywords similarity," Lecture Notes in Engineering and Computer Science, vol. 2202, no. May 2017, pp. 380{384, 2013.
dc.relation	Knime.org \| Open for innovation, \KNIME Analytics Platform," 2015.
dc.relation	R. Hughey and A. Krogh, \Hidden markov models for sequence analysis: Extension and analysis of the basic method," Bioinformatics, vol. 12, no. 2, pp. 95{107, 1996.
dc.relation	Superintendencia de Industria y Comercio, \Estudio econ_omico del sector Retail en Colombia (2010-2012)," 2012.
dc.relation	Departamento Adminisitrativo Nacional de Estad__stica (DANE), \Censo Nacional de Poblaci_on y Vivienda 2018," 2018.
dc.relation	F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Recommender Systems Handbook. 2011.
dc.relation	Tienda Registrada\| Sabemos de Tiendas, \Noticias de la Tienda. Para la industria del consumo masivo," Tech. Rep. 48, Medell__n, 2019.
dc.relation	P. Jariha and S. K. Jain, \A state-of-the-art Recommender Systems: An overview on Concepts, Methodology and Challenges," Proceedings of the International Conference on Inventive Communication and Computational Technologies, ICICCT 2018, no. Icicct, pp. 1769{1774, 2018.
dc.relation	S. van de Sanden, K. Willems, and M. Brengman, \In-store location-based marketing with beacons: from inated expectations to smart use in retailing," Journal of Marketing Management, vol. 35, no. 15-16, pp. 1514{1541, 2019.
dc.rights	Atribución-NoComercial-SinDerivadas 4.0 Internacional
dc.rights	Acceso abierto
dc.rights	http://creativecommons.org/licenses/by-nc-nd/4.0/
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	Derechos reservados - Universidad Nacional de Colombia
dc.title	Diseño de una estrategia de limpieza y estandarización de direcciones postales a través de redes neurales recurrentes tipo LSTM
dc.type	Otro

Este ítem pertenece a la siguiente institución

Universidad Nacional de Colombia