dc.contributorCiferri, Ricardo Rodrigues
dc.contributorhttp://lattes.cnpq.br/8382221522817502
dc.contributorhttp://lattes.cnpq.br/1940393978436664
dc.creatorMatos, Pablo Freire
dc.date.accessioned2010-10-18
dc.date.accessioned2016-06-02T19:05:46Z
dc.date.available2010-10-18
dc.date.available2016-06-02T19:05:46Z
dc.date.created2010-10-18
dc.date.created2016-06-02T19:05:46Z
dc.date.issued2010-09-24
dc.identifierMATOS, Pablo Freire. Metodologia de pré-processamento textual para extração de informação sobre efeitos de doenças em artigos científicos do domínio biomédico. 2010. 161 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2010.
dc.identifierhttps://repositorio.ufscar.br/handle/ufscar/448
dc.description.abstractThere is a large volume of unstructured information (i.e., in text format) being published in electronic media, in digital libraries particularly. Thus, the human being becomes restricted to an amount of text that is able to process and to assimilate over time. In this dissertation is proposed a methodology for textual preprocessing to extract information about disease effects in the biomedical domain papers, in order to identify relevant information from a text, to structure and to store this information in a database to provide a future discovery of interesting relationships between the extracted information. The methodology consists of four steps: Data Entrance (Step 1), Sentence Classification (Step 2), Identification of Relevant Terms (Step 3) and Terms Management (Step 4). This methodology uses three information extraction approaches from the literature: machine learning approach, dictionary-based approach and rule-based approach. The first one is developed in Step 2, in which a supervised machine learning algorithm is responsible for classify the sentences. The second and third ones are developed in Step 3, in which a dictionary of terms validated by an expert and rules developed through regular expressions were used to identify relevant terms in sentences. The methodology validation was carried out through its instantiation to an area of the biomedical domain, more specifically using papers on Sickle Cell Anemia. Accordingly, two case studies were conducted in both Step 2 and in Step 3. The obtained accuracy in the sentence classification was above of 60% and F-measure for the negative effect class was above of 70%. These values correspond to the results achieved with the Support Vector Machine algorithm along with the use of the Noise Removal filter. The obtained F-measure with the identification of relevant terms was above of 85% for the fictitious extraction (i.e., manual classification performed by the expert) and above of 80% for the actual extraction (i.e., automatic classification performed by the classifier). The F-measure of the classifier above of 70% and F-measure of the actual extraction above 80% show the relevance of the sentence classification in the proposed methodology. Importantly to say that many false positives would be identified in full text papers without the sentence classification step.
dc.publisherUniversidade Federal de São Carlos
dc.publisherBR
dc.publisherUFSCar
dc.publisherPrograma de Pós-Graduação em Ciência da Computação - PPGCC
dc.rightsAcesso Aberto
dc.subjectBanco de dados
dc.subjectMineração de textos
dc.subjectArtigos científicos
dc.subjectDomínio biomédico
dc.subjectPré-processamento textual
dc.subjectExtração de informação
dc.subjectTextual preprocessing
dc.subjectInformation extraction
dc.subjectFull papers
dc.subjectBiomedical domain
dc.titleMetodologia de pré-processamento textual para extração de informação sobre efeitos de doenças em artigos científicos do domínio biomédico
dc.typeTesis


Este ítem pertenece a la siguiente institución