Spärck: Information retrieval system of machine learning good practices for software engineering

Cabra Acela, Laura Helena

dc.contributor	Linares Vásquez, Mario
dc.contributor	Mojica Hanke, Anamaría Irmgard
dc.creator	Cabra Acela, Laura Helena
dc.date.accessioned	2023-01-31T19:12:34Z
dc.date.accessioned	2023-09-07T02:14:23Z
dc.date.available	2023-01-31T19:12:34Z
dc.date.available	2023-09-07T02:14:23Z
dc.date.created	2023-01-31T19:12:34Z
dc.date.issued	2022-12-15
dc.identifier	http://hdl.handle.net/1992/64399
dc.identifier	instname:Universidad de los Andes
dc.identifier	reponame:Repositorio Institucional Séneca
dc.identifier	repourl:https://repositorio.uniandes.edu.co/
dc.identifier.uri	https://repositorioslatinoamericanos.uchile.cl/handle/2250/8729082
dc.description.abstract	In this project, we propose a tool for the developers to search for good machine learning (ML) practices appropriate for the software engineering (SE) assignments they are working on. We expect this tool makes ML good practices easily accessible and promotes their use. For this, we defined a structure that described the relationships between stages of the ML pipeline, tasks, and good practices. Moreover, we implemented and validated an information retrieval (IR) model for the good practices gathered. Furthermore, we developed and validated a platform that allows users to search for good practices in ML for SE. This platform includes three main features: (i) a search bar that uses the implemented IR model. (ii) a tool to filter the practices by tasks. (iii) an interactive tool that classifies the information by the relationship between stages, tasks, and practices.
dc.language	eng
dc.publisher	Universidad de los Andes
dc.publisher	Ingeniería de Sistemas y Computación
dc.publisher	Facultad de Ingeniería
dc.publisher	Departamento de Ingeniería Sistemas y Computación
dc.relation	M. Alshangiti, H. Sapkota, P. K. Murukannaiah, X. Liu, and Q. Yu. ¿Why is Developing Machine Learning Applications Challenging? A Study on Stack Overflow Posts?. In: 2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 2019, pp. 1-11 (cit. on p. 3)
dc.relation	Saleema Amershi, Andrew Begel, Christian Bird, et al. "Software engineering for machine learning: A case study". In: 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE- SEIP). IEEE. 2019, pp. 291-300 (cit. on pp. 3, 9, 23)
dc.relation	AWS. Monitor, detect, and handle model performance degradation (cit. on pp. 26, 27)
dc.relation	Stella Biderman and Walter J Scheirer. "Pitfalls in machine learning research: Reexamining the development cycle". In: (2020) (cit. on p. 3)
dc.relation	Steven Bird, Ewan Klein, and Edward Loper. Natural language processing with Python: analyzing text with the natural language toolkit. " O'Reilly Media, Inc.", 2009 (cit. on p. 9)
dc.relation	David M Blei, Andrew Y Ng, and Michael I Jordan. "Latent dirichlet allocation". In: Journal of machine Learning research 3.Jan (2003), pp. 993-1022 (cit. on p. 9)
dc.relation	Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, and Gerhard Weikum. "Probabilistic information retrieval approach for ranking of database query results". In: ACM Transactions on Database Systems (TODS) 31.3 (2006), pp. 1134-1168 (cit. on p. 4)
dc.relation	Jai Raj Choudhary. What is model validation. 2020 (cit. on pp. 26, 27)
dc.relation	CloudFactory. The Ultimate Guide to data labeling for machine learning (cit. on pp. 26, 27)
dc.relation	European Commission. HIGH-LEVEL EXPERT GROUP ON ARTIFICIAL INTELLI- GENCE. 2019 (cit. on p. 3)
dc.relation	Datagen. Model training. 2022 (cit. on pp. 26, 27)
dc.relation	dewangNautiyal. ML: Underfitting and overfitting. 2022 (cit. on pp. 26, 27)
dc.relation	Universidad Duke. Model maintenance (cit. on pp. 26, 27)
dc.relation	Davide Falessi, Natalia Juristo, Claes Wohlin, et al. "Empirical software engineering experts on the use of students and professionals in experiments". In: Empirical Software Engineering 23.1 (2018), pp. 452-489 (cit. on p. 17)
dc.relation	Robert Feldt, Thomas Zimmermann, Gunnar R Bergersen, et al. "Four commentaries on the use of students and professionals in empirical software engineering experiments". In: Empirical Software Engineering 23.6 (2018), pp. 3801-3820 (cit. on p. 17)
dc.relation	Google. Creating instructions for human labelers (cit. on pp. 26, 27)
dc.relation	Google. Introduction to transforming data (cit. on pp. 26, 27)
dc.relation	Bingbing Jiang, Zhengyu Li, Huanhuan Chen, and Anthony G Cohn. "Latent topic text representation learning on statistical manifold". In: IEEE transac- tions on neural networks and learning systems 29.11 (2018), pp. 5643-5654 (cit. on p. 8)
dc.relation	Markku Lahtela and Philip (Provenance) Kaplan. What is data labeling. 1966 (cit. on pp. 26, 27)
dc.relation	Seok Won Lee and David C Rine. "Missing requirements and relationship discovery through proxy viewpoints model. In: Proceedings of the 2004 ACM symposium on Applied Computing. 2004, pp. 1513-1518 (cit. on pp. 4, 5)
dc.relation	Michael A. Lones. ¿How to avoid machine learning pitfalls: a guide for academic researchers?. In: CoRR abs/2108.02497 (2021). arXiv: 2108.02497 (cit. on pp. 3, 23)
dc.relation	Lotame. What are the methods of data collection?: How to collect data. 2022 (cit. on pp. 26, 27)
dc.relation	Andrea De Lucia, Fausto Fasano, Rocco Oliveto, and Genoveffa Tortora. "Recovering traceability links in software artifact management systems using information retrieval methods". In: ACM Transactions on Software Engineering and Methodology (TOSEM) 16.4 (2007), 13 es (cit. on pp. 4, 5)
dc.relation	Anamaria Mojica-Hanke, Andrea Bayona, Mario Linares-Vásquez, Steffen Herbold, and Fabio A. González. What are the Machine Learning best practices reported by practitioners on Stack Exchange? (Cit. on pp. 4, 9)
dc.relation	Nicolás Munar González and Nicolás Tobo Urrutia. "Software best practices for machine learning." In: 2022 (cit. on p. 4)
dc.relation	Google PAIR. People + AI Guidebook. 2021 (cit. on pp. 3, 4, 9)
dc.relation	Harshil Patel. What is feature engineering-importance, tools and techniques for machine learning. 2021 (cit. on pp. 26, 27)
dc.relation	Martin F Porter. "An algorithm for suffix stripping". In: Program (1980) (cit. on p. 9)
dc.relation	Stephen Robertson, Hugo Zaragoza, et al. "The probabilistic relevance framework: BM25 and beyond". In: Foundations and Trends® in Information Retrieval 3.4 (2009), pp. 333-389 (cit. on p. 9)
dc.relation	Gerard Salton, Anita Wong, and Chung-Shu Yang. "A vector space model for automatic indexing". In: Communications of the ACM 18.11 (1975), pp. 613- 620 (cit. on pp. 8, 9)
dc.relation	Alex Serban, Koen van der Blom, Holger Hoos, and Joost Visser. "Adoption and effects of software engineering best practices in machine learning". In: Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). 2020, pp. 1-12 (cit. on pp. 3, 23)
dc.relation	Deval Shah. The Essential Guide to data augmentation in Deep Learning (cit. on pp. 26, 27)
dc.relation	Eric J Stierna and Neil C Rowe. "Applying information-retrieval methods to software reuse: a case study". In: Information processing & management 39.1 (2003), pp. 67-74 (cit. on pp. 4, 5)
dc.relation	SuperAnnotate. The Ultimate Guide to Data Labeling: How to label data for ML (cit. on pp. 26, 27)
dc.relation	Tableau. Guide to data cleaning: Definition, benefits, components, and how to clean your data (cit. on pp. 26, 27)
dc.relation	Talend. What is data profiling? data profiling tools and examples (cit. on pp. 26, 27)
dc.relation	CFI Team. Data Anonymization. 2022 (cit. on pp. 26, 27)
dc.relation	Michail Vlachos. "Dimensionality Reduction". In: Encyclopedia of Machine Learning. Ed. by Claude Sammut and Geoffrey I. Webb. Boston, MA: Springer US, 2010, pp. 274-279 (cit. on pp. 26, 27)
dc.relation	Kathleen Walch. How to build a machine learning model in 7 steps: TechTarget. 2021 (cit. on pp. 26, 27)
dc.relation	David Weedmark. A 4-step guide to machine learning model deployment. 2022 (cit. on pp. 26, 27)
dc.relation	Brett Wujek, Patrick Hall, and Funda Gunes. "Best practices for machine learning applications". In: SAS Institute Inc (2016) (cit. on p. 3)
dc.relation	Haining Yao, Letha H Etzkorn, and Shamsnaz Virani. "Automated classification and retrieval of reusable software components". In: Journal of the American society for information science and technology 59.4 (2008), pp. 613-627 (cit. on pp. 4, 5)
dc.relation	Martin Zinkevich. Rules of machine learning: Best Practices for ML Engineering. 2021 (cit. on p. 3)
dc.rights	Atribución-CompartirIgual 4.0 Internacional
dc.rights	Atribución-CompartirIgual 4.0 Internacional
dc.rights	http://creativecommons.org/licenses/by-sa/4.0/
dc.rights	info:eu-repo/semantics/openAccess
dc.rights	http://purl.org/coar/access_right/c_abf2
dc.title	Spärck: Information retrieval system of machine learning good practices for software engineering
dc.type	Trabajo de grado - Pregrado

Este ítem pertenece a la siguiente institución

Universidad de los Andes (Colombia)