Tese de Doutorado
Mineração de texto, agrupamento de seqüências e integração de dados para o desenvolvimento da Plant Defense Mechanisms Database
Fecha
2008-05-26Autor
Adriano Barbosa da Silva
Institución
Resumen
This work aims to describe the technologies used for the Plant Defense Mechnaisms Database development, a database about the defense mechanisms against biotic and abiotic types of stresses in plants. For this purpose we have developed the program LAITOR, this is used in order to identify in the scientific literature the protein terms and names of biotic and abiotic stimuli (bioentities) along with terms indicating of a biological action (bioaction), nevertheless, validating those occurrences in the same sentence only. The tool NLPROT has been used for the initial bioentities tagging which were validated a posteriori by LAITOR. Later, for those protein terms which belong to the NCBI Gene database and with a corresponding record in the UniProtKB database, it was performed the clustering of sequences belonging to other organisms deposited in the same UniProtKB database, to achieve this aim we developed the Seed Linkage software. This software exploits direct and indirect multiple links from the sequences of these organisms to the initially determined seed. We found that the raw and relative scores of 400 and 0.3, respectively, are those which maximizes the inclusion of correct sequences in the rebuilding of a manually inspected clusters dataset. After the identification of 780 protein terms from the analysis of 7,306 scientific abstracts using the program LAITOR, 1,390 unique UniProtKB identifiers were used to cluster 15,669 sequences in the 611 clusters of the PubMed database. We have developed a software library, named SRS.php, to acquire the information referring to each of these proteins, using for this purpose the SRS server installed at the EMBL using the Web Services technology. With the usage of this library, a SOAP client accesses the server and retrieve, in a programmatic manner, the available data. After to perform the text mining analysis with the program LAITOR, the sequence clustering using the Seed Linkage software, and the subsequent data acquisition using the SOAP protocol, all these information were made available by a HTML server at http://www.biodados.icb.ufmg.br/pdm. In this website, users are able to perform a search using keywords or a BLAST-based similarity search. After the visualization of the retrieved records, a link is created for the co-occurrence of the protein terms in the text mining analysis, as well as for the phylogenetic tree of the proteins grouped in each PDM cluster. Furthermore, we have implemented the PDM SOAP server, which enables the distribution of PDM data through Web Services. We have created a method, named query_pdm, where any record deposited in this database can be accessed using SOAP. Summarizing, we present a set of methods implemented as software components, or programs in fact, which can be used in similar applications to PDM, being, therefore, freely available for the scientific community interested in such techniques