Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico

Duque, Juliana Lilian

Tesis

Fecha

2012-02-24

Registro en:

DUQUE, Juliana Lilian. Um processo baseado em parágrafos para a extração de tratamentos de artigos científicos do domínio biomédico. 2012. 124 f. Dissertação (Mestrado em Ciências Exatas e da Terra) - Universidade Federal de São Carlos, São Carlos, 2012.

https://repositorio.ufscar.br/handle/ufscar/496

Autor

Duque, Juliana Lilian

Institución

Universidade Federal de São Carlos (Brasil)

Resumen

Currently in the medical field there is a large amount of unstructured information (i.e., in textual format). Regarding the large volume of data, it makes it impossible for doctors and specialists to analyze manually all the relevant literature, which requires techniques for automatically analyze the documents. In order to identify relevant information, as well as to structure and store them into a database and to enable future discovery of significant relationships, in this paper we propose a paragraph-based process to extract treatments from scientific papers in the biomedical domain. The hypothesis is that the initial search for sentences that have terms of complication improves the identification and extraction of terms of treatment. This happens because treatments mainly occur in the same sentence of a complication, or in nearby sentences in the same paragraph. Our methodology employs three approaches for information extraction: machine learning-based approach, for classifying sentences of interest that will have terms to be extracted; dictionary-based approach, which uses terms validated by an expert in the field; and rule-based approach. The methodology was validated as proof of concept, using papers from the biomedical domain, specifically, papers related to Sickle Cell Anemia disease. The proof of concept was performed in the classification of sentences and identification of relevant terms. The value obtained in the classification accuracy of sentences was 79% for the classifier of complication and 71% for the classifier of treatment. These values are consistent with the results obtained from the combination of the machine learning algorithm Support Vector Machine with the filter Noise Removal and Balancing of Classes. In the identification of relevant terms, the results of our methodology showed higher F-measure percentage (42%) compared to the manual classification (31%) and to the partial process, i.e., without using the classifier of complication (36%). Even with low percentage of recall, there was no impact observed on the extraction process, and, in addition, we were able to validate the hypothesis considered in this work. In other words, it was possible to obtain 100% of recall for different terms, thus not impacting the extraction process, and further the working hypothesis of this study was proven.