Tese
Integração de dados para avaliação da qualidade da anotação dos genes codificadores de proteínas em eucariotos
Autor
Juliana Assis Geraldo
Institución
Resumen
Whole genome sequencing studies are becoming common in view of the low cost of the sequencing technologies currently available. In consequence, the volume of genome projects is rapidly increasing, and complete genomes are now available for a wide variety of species. Due to the amount of new whole genome sequencing several software and strategies has been developed to evaluate the genome assembly quality. Even in the face of a high-quality genome assembled, the challenge of obtaining a good genome annotation remains. One of the biggest claims is to evaluate the quality of the whole genome annotation. The process of evaluating annotation quality, for many times, is still performed manually which is costly, especially for large and complex genomes. The present study aimed to comprehend the challenges of structural annotation of genes encoding proteins from complete genomes of eukaryotic organisms, as well as, proposed to develop a new method based on synteny of orthologs and integration of multi-omics data, to evaluate automatically the quality of the annotations generated, thus reducing the time of manual curation of the genes encoding proteins. To obtain the result, genes encoding proteins in whole genomes of different eukaryotic organisms were required for the following organisms: Panthera onca (mammal), Plasmodium coatneyi and Plasmodium knowlesi (small genome parasites), Schistosoma mansoni (medium genome parasite and high complexity of structure). The genomes cover different characteristics to represent the diversity between the annotation processes. During the annotation process, was possible to raise the cases of annotation errors that can be detected automatically. In this context, a platform was developed for automatic evaluation of the quality of the genes encoding proteins. The platform allows to detect the errors using multi-omic data integration, with synteny information from orthologous genes of closely related species and information on the structure of the gene annotation. In total, the program contains three modules: 1- Synteny of Orthologous, 2- Structural and 3- Transcriptional. The genes with possible errors detected receive a low score, while the reliable genes are assigned with a higher score. Thus, the new generated output file can be loaded directly into programs such as WebApollo an Artemis to perform a manual curation on those genes with low scoring, reducing manual annotation curation time. It was possible to reduce by 58% the need for manual curation of the genes encoding proteins of the studied genomes.