Trabajo de grado - Maestría
Breast cancer diagnosis and prognosis improvement based on a complete gene expression profile and ancestry
Fecha
2023-06-06Registro en:
instname:Universidad de los Andes
reponame:Repositorio Institucional Séneca
Autor
Stepanian Rozo, Johanna
Institución
Resumen
Breast cancer (BC) is the first leading cause of death in women around the world. Accurate subtyping of BC based on gene expression is crucial for optimizing treatment strategies. This study aims to improve BC subtyping by integrating comprehensive gene expression profiles and ancestry information.
We acquired and analyzed a dataset of 406 RNA-Seq samples sequenced from patients with diverse ancestries, geographical origins, and ethnicities, combining 318 publicly available samples with 88 new samples.
RNA-Seq from breast tumor data were used to genotype 10,397 SNPs and predict ancestries using the software ADMIXTURE, and combining the RNA-seq genotype calls with calls from the 1000 genomes project. We ran the genefu R package for predicting PAM50 subtypes, and recreated the results using the supervised machine learning algorithms random forest (RF) and support vector machine (SVM), achieving accuracy rates of 0.95 and 0.92, respectively.
We integrated ancestry prediction to supervised machine learning experiments resulting in less favorable metrics, achieving accuracy values of 0.86 and 0.85 for RF and SVM, respectively. Additionally, we observed discrepancies between the assigned PAM50 subtypes, the gene expression patterns, and ancestry prediction. To overcome possible biases in the genefu predictions, we investigated the clusters obtained following an unsupervised machine learning strategy, based on the K-means algorithm. The results suggest novel gene expression-based subgroups within breast cancer tumors, where a part of the variance is explained by ancestry. In conclusion, this study highlights the importance of integrating gene expression profiles and ancestry information in BC clustering, contributing to understanding BC heterogeneity. The findings presented lay the groundwork for improved treatment decision-making in BC management.