bachelorThesis
Unsupervised machine learning for the classification of astrophysical X-ray sources
Autor
Pérez Díaz, Víctor Samuel
Institución
Resumen
Context. The Chandra Source Catalog (CSC), which collects the X-ray sources detected by the Chandra X-ray Observatory through its history, is a fertile ground for discovery, because many of the sources it contains have not been studied in detail. In CSC we could find several types of sources, from young stellar objects (YSO) and binary systems, to even very far quasars (QSO) or active galaxies with supermassive black holes in their cores. Among the potentially paradigm changing sources that we could look for in Chandra data are compact object mergers, extrasolar planet transits, tidal disruption events, etc. However, only a small fraction of the CSC sources have been classified. In order to conduct a thorough investigation of the CSC sources, and to be prepared for the coming very large X-ray surveys, we need to classify as many catalog sources as possible. Aims. This work proposes an unsupervised learning approach to classify as many Chandra Source Catalog sources as possible, first exploring the advantages and limits of using only the X-ray data available. Unsupervised learning is particularly suitable given the vast amount of detections that have not been independently classified yet. Clustering the source observations by their similarities, and then associating these clusters with objects previously classified spectroscopically, we aim to propose a new methodology that could provide us with a probabilistic classification for a numerous amount of sources. Methods. We employ unsupervised learning methods, first K-means, then focusing on Gaussian Mixtures, applied to a list of X-ray properties, to probabilistically classify high energy sources in the Chandra Source Catalog (CSC). We achieve this by associating specific clusters with those CSC objects that have a classification in the SIMBAD database, and then assigning probabilistic classes by association to unclassified objects in each cluster with an algorithm based on the Mahalanobis distance. Results. We are able to successfully identify clusters of previously identified objects that likely belong to the same class, and even within groups that were identified as having predominantly a type of source, such as "galaxies", "QSO", "YSO", we find sub-classes related to their unique variability and spectral properties. The result of this exercise is a robust probabilistic classification (i.e. a posterior over classes) for 10090 of CSC sources. The tables for each cluster and respective code is available at https://github.com/BogoCoder/astrox. Conclusions. We developed a methodology to provide probabilistic class assignation to numerous X-ray sources of the Chandra Source Catalog. Through this process we have seen that it is possible to construct a pipeline based on unsupervised machine learning for this task. We have seen that our approach works well for particular general type of sources, such as a YSO, or extra-galactic sources. In other cases, we have ambiguity in the number of classes presented in a particular cluster, having very different predominant types within them. This ambiguity might be solved by an addition of other wavelength regime data, such as optical from SDSS (Sloan Digital Survey Summary). This analysis is planned for a future work. This thesis present an early approach for the final goal of classifying all possible CSC sources that lacks of a class.