dc.contributor | González, Fabio Augusto | |
dc.contributor | Mindlab | |
dc.creator | Ceballos Gallego, Santiago | |
dc.date.accessioned | 2021-01-26T15:23:07Z | |
dc.date.available | 2021-01-26T15:23:07Z | |
dc.date.created | 2021-01-26T15:23:07Z | |
dc.date.issued | 2020-12-09 | |
dc.identifier | Ceballos, S. (2020). Diseño de una estrategia de limpieza y estandarización de direcciones postales a través de redes neurales recurrentes tipo LSTM [Tesis de maestría, Universidad Nacional de Colombia]. Repositorio Institucional. | |
dc.identifier | https://repositorio.unal.edu.co/handle/unal/78923 | |
dc.description.abstract | Las direcciones geográficas son uno de los elementos más comunes en las bases de datos de diferentes tipos de organizaciones. Sin embargo, el registro de dichas direcciones se realiza, a menudo, de forma manual y sin un formato de referencia, lo que da lugar a múltiples representaciones de los elementos que componen la dirección. Esto, a su vez, genera que el registro sea usualmente inutilizable para fines de geolocalización automática, un área cada vez más relevante en los principales sectores de la economía.
En el presente documento se propone una metodología para la limpieza y estandarización de direcciones geográficas, basada en redes neuronales recurrentes tipo LSTM, como solución a este problema. Dicha metodología, incluye la estrategia de generación de un conjunto de datos sintético, para el entrenamiento de la red, que está compuesto por direcciones no estructuradas y las direcciones equivalentes en formato estándar. El desempeño del modelo se mide en dos conjuntos de datos diferentes: El primero contiene 10000 direcciones sintéticas sucias y su equivalente limpio, contra el cual se compara la dirección genearada utilizando los índices de Jaccard, Jaro y Levenshtein, como medidas de similitud; el segundo, contiene 5000 direcciones reales de establecimientos comerciales en las tres principales ciudades de Colombia, para los cuales se cuenta con la geolocalización exacta. Esta ubicación real se compara con la obtenida tras geolocalizar la dirección resultante del proceso de estandarización.
Al aplicar esta estrategia, se evidencia una mejora significativa tanto en la precisión del formato estándar obtenido, como en la geolocalización de la dirección resultante, cuando se compara contra los dos modelos base más utilizados en este campo: el modelo basado en reglas de limpieza y el modelo basado en cadenas de Markov ocultas.
Por ´ultimo, se muestran aplicaciones de la metodología para limpieza y geolocalización de direcciones tomadas de una base de datos real, en ´ámbitos como la optimización de fuerza de ventas, la atención al cliente y el mercadeo digital. | |
dc.description.abstract | Postal addresses are one of the most common elements in current organizations’ databases. However, the register of these addresses is usually made in a manual way and not following any standard format, which may result in multiple representations for items in the address (e.g., street, avenue, apartment number, etc.) and therefore hindering the efforts to take value out of those registers. In this document we proposed a cleansing and standardization methodology for postal addresses, based on Long-Short-Term Memory (LSTM) neural networks. It includes the strategy to generate synthetic registers used for training purposes and composed of non-structured addresses and their equivalents in standard format. We measure model performance using two different data sets. First data set contains up to 10000 registers of new synthetic non-standard addresses with their clean equivalent, which is compared with the result of the model using Jaro, Jaccard and Levenshtein indexes as similarity measures. The second data set contains 5000 real addresses (anonymized) from commercial establishments, located in three main cities in Colombia as well as their real locations, which are compared against geolocation obtained from the clean address given by the model. The proposed methodology is shown to make a significant improvement in both, the accuracy of the string text obtained versus the expected standard format, and the geolocation obtained; when compared with the main strategies used for this purpose: rules-based models and Hidden Markov models. We also present some real applications of the proposed strategy in diverse areas such as sales routes optimization, digital marketing and customer service. | |
dc.language | spa | |
dc.publisher | Bogotá - Ingeniería - Maestría en Ingeniería - Ingeniería de Sistemas y Computación | |
dc.publisher | Universidad Nacional de Colombia - Sede Bogotá | |
dc.relation | V. Borkar, K. Deshmukh, and S. Sarawagi, “Automatic segmentation of text into structured records,” ACM SIGMOD Record, vol. 30, no. 2, pp. 175–186, 2001. | |
dc.relation | D.¨u¸c¨uk Matci@ and U. Avdan, “Address standardization using the natural language process for improving geocoding results,” Computers, Environment and Urban Systems, vol. 70, no. February, pp. 1–8, 2018. | |
dc.relation | G. Sharma, Shikhar; Ratti, Ritesh; arora, Ishaan; Solanki, Anshul,; Bhatt, “Automated Parsing of Geographical Addresses : A Multilayer Feedforward Neural Network based approach,” in IEEE international Conference on Semantic Computing, pp. 123–130, 2018 | |
dc.relation | O. F. I. Pach´on Quevedo and S. I. Tellez, “Propuesta de Est´andar de las Direcciones Urbanas para los Equipamientos del Ministerio de Educaci´on,” p. 42, 2009. | |
dc.relation | D. W. Goldberg, J. N. Swift, and J. P. Wilson, “Address Standardization,” Tech.
Rep. 12, 2017. | |
dc.relation | K. Malik, Muhammad Noman; Abdul, “Address Standardization using Supervised Machine Learning,” in 2011 International Conference on Computer Communication and Management, no. November, 2015 | |
dc.relation | V. Borkar, K. Deshmukh, and S. Sarawagi, “Automatically extracting structure from free text addresses,” IEEE Data Eng. Bull., vol. 23, no. 4, pp. 27–32, 2000. | |
dc.relation | I. Mulasastra and A. Taplaksint, “Elementization of Thai Postal Addresses : A Hybrid Approach,” in 2015 IEEE International Conference on Electrical and Computer Engineering (WIECON-ECE), 2015. | |
dc.relation | G. Kothari, T. A. Faruquie, L. V. Subramaniam, K. H. Prasad, and M. K. Mohania, “Transfer of supervision for improved address standardization,” Proceedings - International Conference on Pattern Recognition, pp. 2178–2181, 2010. | |
dc.relation | D.¨u¸c¨uk Matci@ and U. Avdan, “Address standardization using the natural language process for improving geocoding results,” Computers, Environment and Urban Systems, vol. 70, no. January 2017, pp. 1–8, 2018. | |
dc.relation | M. N. Masrek and Z. A. Razak, \Malaysian address semantic: The process of standardization," 2nd International Conference on Computer Research and Development, ICCRD 2010, pp. 77{80, 2010. | |
dc.relation | G. K. Tanveer, A. F. L. Venkata, S. K. Hima, and P. Mukesh, \Transfer of supervision for improved address standardization," in 2010 International Conference on Pattern Recognition, pp. 2182{2185, 2010. | |
dc.relation | Informatica, \Address Validation Best Practices for Interpreting and AnalizingAddress Data Quality Results," 2013. | |
dc.relation | Runner enterprise Data Quality, \ADDRESS DATA CLEANSING: A BETTER APPROACH," 2017. | |
dc.relation | R. A. Abbasi, \Information Extraction Techniques for Postal Address Standardization," Faculty of Computing - Riphap International University, 2005. | |
dc.relation | C. Lin, K. Choy, G. Ho, S. Chung, and H. Lam, \Survey of Green Vehicle Routing Problem: Past and future trends," Expert Systems with Applications, vol. 41, pp. 1118{1138, mar 2014. | |
dc.relation | H. Jafari, \e-Commerce Logistics ^a\ Contemporary Literature," 2018 IEEE International Conference on Industrial Engineering and Engineering Management (IEEM), pp. 1196{1200, 2018. | |
dc.relation | P. Christen, T. Churches, and A. Willmore, \A probabilistic geocoding system based on
a national address _le," Proceedings of the 3rd Australasian Data Mining Conference, Cairns, 2004. | |
dc.relation | P. Rogerson, D. Han, J. L. Freudenheim, J. E. Vena, M. R. Bonner, and J. Nie, \Positional Accuracy of Geocoded Addresses in Epidemiologic Research," Epidemiology, vol. 14, no. 4, pp. 408{412, 2004. | |
dc.relation | S. A. Collier, L. J. Stockman, L. A. Hicks, L. E. Garrison, F. J. Zhou, and M. J. Beach, \Direct healthcare costs of selected diseases primarily or partially transmitted by water.," Epidemiology and infection, vol. 140, pp. 2003{13, nov 2012. | |
dc.relation | M. R. Cayo and T. O. Talbot, \Positional error in automated geocoding of residential addresses," International Journal of Health Geographics, vol. 2, pp. 1{12, 2003. | |
dc.relation | C. A. Davis and F. T. Fonseca, \Assessing the certainty of locations produced by an address geocoding system," GeoInformatica, vol. 11, no. 1, pp. 103{129, 2007. | |
dc.relation | J. H. Ratcli_e, \Geocoding crime and a _rst estimate of a minimum acceptable hit rate," International Journal of Geographical Information Science, vol. 18, pp. 61{72, jan 2004. | |
dc.relation | D. P. Johnson, A. Stanforth, V. Lulla, and G. Luber, \Developing an applied extreme heat vulnerability index utilizing socioeconomic and environmental data," Applied Geography, vol. 35, pp. 23{31, nov 2012. | |
dc.relation | SmartyStreets, \USPS & International Address Veri_cation - SmartyStreets." | |
dc.relation | egon: Address Quality, \EGON - Company informations," 2019. | |
dc.relation | EXPERIAN, \Address validation from Experian QAS," 2018. | |
dc.relation | M. Wang, V. Haberland, A. Yeo, A. Martin, J. Howroyd, and J. M. Bishop, \A Probabilistic Address Parser Using Conditional Random Fields and Stochastic Regular Grammar," IEEE International Conference on Data Mining Workshops, ICDMW, pp. 225{232, 2017. | |
dc.relation | R. G. Crowder, Principles of Learning and Memory: Classic Edition, vol. 2014. 2014.
[30] N. Reimers and I. Gurevych, \Reporting Score Distributions Makes a Di_erence: Performance Study of LSTM-networks for Sequence Tagging Nils," in Ubiquitous Knowledge Processing Lab (UKP-DIPF), 2017. | |
dc.relation | E. Ma, Xuezhe; Hovy, \End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-
CRF," in Language Tecnologies Institute, 2016. | |
dc.relation | G. Xiang, Bing; Kurata, \Leveraging Sentence-level Information with Encoder LSTM
for Semantic Slot Filling," in IBM Research, 2016. | |
dc.relation | B. Liu and I. Lane, \Attention-Based Recurrent Neural Network Models for Joint Intent
Detection and Slot Filling." | |
dc.relation | J. P. C. Chiu and E. Nichols, \Named Entity Recognition with Bidirectional LSTMCNNs,"
in University of British Columbia; Honda Research Institute Japan CO,no. 2003, 2014. | |
dc.relation | F. Xu, G. Yi, W. Qi, and F. Zhen, \Research on Automatic Summary of Chinese Short
Text Based on LSTM and Keywords Correction *," in Tenth International Conference on Advanced Computational Intelligence (ICACI), no. 17, pp. 467{472, 2018. | |
dc.relation | S. Pascual and A. Bonafonte, \Multi-output RNN-LSTM for multiple speaker speech
synthesis and adaptation," in European Signal Processing Conference (EUSIPCO),pp. 2325{2329, 2016 | |
dc.relation | D. Wei, B. Wang, G. Lin, D. Liu, Z. Dong, H. Liu, and Y. Liu, \Research on Unstructured Text Data Mining and Fault Classi_cation Based on RNN-LSTM with MalfunctionInspection Report," Energies, vol. 10, no. 406, 2017. | |
dc.relation | K. Yao, B. Peng, Y. Zhang, D. Yu, G. Zweig, and Y. Shi, \SPOKEN LANGUAGE UNDERSTANDING
USING LONG SHORT-TERM MEMORY NEURAL NETWORKS,"in Microsoft, pp. 189{194, 2014. | |
dc.relation | O. Morillot, L. Likforman-Sulem, and E. Grosicki, \New baseline correction algorithm for text-line recognition with bidirectional recurrent neural networks," Journal of Electronic Imaging, vol. 22, no. 2, p. 023028, 2013. | |
dc.relation | M.-T. Luong, H. Pham, and C. D. Manning, \E_ective Approaches to Attention-based Neural Machine Translation," 2015. | |
dc.relation | T. Chen, R. Xu, Y. He, and X. Wang, \Improving sentiment analysis via sentence type classi_cation using BiLSTM-CRF and CNN," Expert Systems With Applications, vol. 72, pp. 221{230, 2017. | |
dc.relation | I. Sutskever, O. Vinyals, and Q. V. Le, \Sequence to sequence learning with neural networks," Advances in Neural Information Processing Systems (NIPS), pp. 3104{3112, 2014. | |
dc.relation | G. Lewis, \Sentence Correction using Recurrent Neural Networks," pp. 1{7, 2015. | |
dc.relation | J. Martens, \Generating Text with Recurrent Neural Networks," Neural Networks, vol. 131, no. 1, pp. 1017{1024, 2011. | |
dc.relation | J. Li, K. Ouazzane, H. B. Kazemian, and M. S. Afzal, \Neural network approaches for noisy language modeling," IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 11, pp. 1773{1784, 2013. | |
dc.relation | S. Zhu and K. Yu, \ENCODER-DECODER WITH FOCUS-MECHANISM FOR SEQUENCE LABELLING BASED SPOKEN LANGUAGE UNDERSTANDING," Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering SpeechLab , Department of Computer Scie, pp. 5675{5679, 2017. | |
dc.relation | F. Liu, T. M. Hospedales, W. Yang, and C. Sun, \Semantic Regularisation for Recurrent Image Annotation," in Computer Vision Foundation, 2016. | |
dc.relation | L. Liu, J. Shang, X. Ren, F. F. Xu, H. Gui, J. Peng, and J. Han, \Empower Sequence Labeling with Task-Aware Neural Language Model," 2017. | |
dc.relation | E. Alpayding, Introduction to Machine Learning Second Edition, vol. 1107. 2010. | |
dc.relation | D. P. Mandic and J. A. Chambers, Recurrent Neural Networks for Prediction. John Wiley & Sons, Ltd, aug 2001. | |
dc.relation | B. V. Merri, \Learning Phrase Representations using RNN Encoder^a\Decoder for Statistical
Machine Translation," 2013. | |
dc.relation | J. Hochreiter, Sepp; Schmidhuber, \LONG SHORT-TERM MEMORY," Neural Computation, vol. 9, no. 8, pp. 1{32, 1997. | |
dc.relation | Google Inc, \Google Maps Platform." | |
dc.relation | OpenStreetMap, \Researcher Information OpenStreetMap," 2017. | |
dc.relation | W. Cohen, P. Ravikumar, and S. Fienberg, \A Comparison of String Distance Metrics for Name-Matching Tasks William," Software: Practice and Experience, vol. 12, no. 1,pp. 57{66, 2003. | |
dc.relation | P. Achananuparp, X. Hu, and X. Shen, \The evaluation of sentence similarity measures," Lecture Notes in Computer Science (including subseries Lecture Notes in Arti_cial Intelligence and Lecture Notes in Bioinformatics), vol. 5182 LNCS, pp. 305{316, 2008. | |
dc.relation | T. Kohonen and P. Somervuo, \Self-organizing maps of symbol strings," Neurocomputing, vol. 21, no. 1-3, pp. 19{30, 1998. | |
dc.relation | S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, \Using of jaccard coefcient for keywords similarity," Lecture Notes in Engineering and Computer Science, vol. 2202, no. May 2017, pp. 380{384, 2013. | |
dc.relation | Knime.org | Open for innovation, \KNIME Analytics Platform," 2015. | |
dc.relation | R. Hughey and A. Krogh, \Hidden markov models for sequence analysis: Extension and
analysis of the basic method," Bioinformatics, vol. 12, no. 2, pp. 95{107, 1996. | |
dc.relation | Superintendencia de Industria y Comercio, \Estudio econ_omico del sector Retail en
Colombia (2010-2012)," 2012. | |
dc.relation | Departamento Adminisitrativo Nacional de Estad__stica (DANE), \Censo Nacional de
Poblaci_on y Vivienda 2018," 2018. | |
dc.relation | F. Ricci, L. Rokach, B. Shapira, and P. Kantor, Recommender Systems Handbook. 2011. | |
dc.relation | Tienda Registrada| Sabemos de Tiendas, \Noticias de la Tienda. Para la industria del
consumo masivo," Tech. Rep. 48, Medell__n, 2019. | |
dc.relation | P. Jariha and S. K. Jain, \A state-of-the-art Recommender Systems: An overview on
Concepts, Methodology and Challenges," Proceedings of the International Conference
on Inventive Communication and Computational Technologies, ICICCT 2018, no. Icicct,
pp. 1769{1774, 2018. | |
dc.relation | S. van de Sanden, K. Willems, and M. Brengman, \In-store location-based marketing
with beacons: from inated expectations to smart use in retailing," Journal of Marketing
Management, vol. 35, no. 15-16, pp. 1514{1541, 2019. | |
dc.rights | Atribución-NoComercial-SinDerivadas 4.0 Internacional | |
dc.rights | Acceso abierto | |
dc.rights | http://creativecommons.org/licenses/by-nc-nd/4.0/ | |
dc.rights | info:eu-repo/semantics/openAccess | |
dc.rights | Derechos reservados - Universidad Nacional de Colombia | |
dc.title | Diseño de una estrategia de limpieza y estandarización de direcciones postales a través de redes neurales recurrentes tipo LSTM | |
dc.type | Otro | |