An automatic approach to generate corpus in Spanish
Fecha
2018Registro en:
Communications in Computer and Information Science; Vol. 885, pp. 150-161
9783319989976
18650929
10.1007/978-3-319-98998-3_12
Universidad Tecnológica de Bolívar
Repositorio UTB
57202285682
8738428200
57194828933
57203852380
Autor
Puertas E.
Alvarado‑Valencia, Jorge Andres
Moreno-Sandoval L.G.
Pomares-Quimbaya A.
Resumen
A corpus is an indispensable linguistic resource for any application of natural language processing. Some corpora have been created manually or semi-automatically for a specific domain. In this paper, we present an automatic approach to generate corpus from digital information sources such as Wikipedia and web pages. The information extracted by Wikipedia is done by delimiting the domain, using a propagation algorithm to determine the categories associated with a domain region and a set of seeds to delimit the search. The information extracted from the web pages is carried out efficiently, determining the patterns associated with the structure of each page with the purpose of defining the quality of the extraction. © Springer Nature Switzerland AG 2018.