Metodologia de pré-processamento textual para extração
de informação sobre efeitos de doenças em artigos científicos do domínio biomédico

Matos, Pablo Freire

dc.contributor	Ciferri, Ricardo Rodrigues
dc.contributor	http://lattes.cnpq.br/8382221522817502
dc.contributor	http://lattes.cnpq.br/1940393978436664
dc.creator	Matos, Pablo Freire
dc.date.accessioned	2010-10-18
dc.date.accessioned	2016-06-02T19:05:46Z
dc.date.available	2010-10-18
dc.date.available	2016-06-02T19:05:46Z
dc.date.created	2010-10-18
dc.date.created	2016-06-02T19:05:46Z
dc.date.issued	2010-09-24
dc.identifier	MATOS, Pablo Freire. Metodologia de pré-processamento textual para extração de informação sobre efeitos de doenças em artigos científicos do domínio biomédico. 2010. 161 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2010.
dc.identifier	https://repositorio.ufscar.br/handle/ufscar/448
dc.description.abstract	There is a large volume of unstructured information (i.e., in text format) being published in electronic media, in digital libraries particularly. Thus, the human being becomes restricted to an amount of text that is able to process and to assimilate over time. In this dissertation is proposed a methodology for textual preprocessing to extract information about disease effects in the biomedical domain papers, in order to identify relevant information from a text, to structure and to store this information in a database to provide a future discovery of interesting relationships between the extracted information. The methodology consists of four steps: Data Entrance (Step 1), Sentence Classification (Step 2), Identification of Relevant Terms (Step 3) and Terms Management (Step 4). This methodology uses three information extraction approaches from the literature: machine learning approach, dictionary-based approach and rule-based approach. The first one is developed in Step 2, in which a supervised machine learning algorithm is responsible for classify the sentences. The second and third ones are developed in Step 3, in which a dictionary of terms validated by an expert and rules developed through regular expressions were used to identify relevant terms in sentences. The methodology validation was carried out through its instantiation to an area of the biomedical domain, more specifically using papers on Sickle Cell Anemia. Accordingly, two case studies were conducted in both Step 2 and in Step 3. The obtained accuracy in the sentence classification was above of 60% and F-measure for the negative effect class was above of 70%. These values correspond to the results achieved with the Support Vector Machine algorithm along with the use of the Noise Removal filter. The obtained F-measure with the identification of relevant terms was above of 85% for the fictitious extraction (i.e., manual classification performed by the expert) and above of 80% for the actual extraction (i.e., automatic classification performed by the classifier). The F-measure of the classifier above of 70% and F-measure of the actual extraction above 80% show the relevance of the sentence classification in the proposed methodology. Importantly to say that many false positives would be identified in full text papers without the sentence classification step.
dc.publisher	Universidade Federal de São Carlos
dc.publisher	BR
dc.publisher	UFSCar
dc.publisher	Programa de Pós-Graduação em Ciência da Computação - PPGCC
dc.rights	Acesso Aberto
dc.subject	Banco de dados
dc.subject	Mineração de textos
dc.subject	Artigos científicos
dc.subject	Domínio biomédico
dc.subject	Pré-processamento textual
dc.subject	Extração de informação
dc.subject	Textual preprocessing
dc.subject	Information extraction
dc.subject	Full papers
dc.subject	Biomedical domain
dc.title	Metodologia de pré-processamento textual para extração de informação sobre efeitos de doenças em artigos científicos do domínio biomédico
dc.type	Tesis

Este ítem pertenece a la siguiente institución

Universidade Federal de São Carlos (Brasil)

Metodologia de pré-processamento textual para extração de informação sobre efeitos de doenças em artigos científicos do domínio biomédico

Este ítem pertenece a la siguiente institución