Tesis
Medición de la efectividad de técnicas de imputación para datos faltantes
Fecha
2021-08-23Registro en:
Vinueza Chalco, Jamilton Daniel; Masaquiza Aragón, Galo Alexander. (2021). Medición de la efectividad de técnicas de imputación para datos faltantes. Escuela Superior Politécnica de Chimborazo. Riobamba.
Autor
Vinueza Chalco, Jamilton Daniel
Masaquiza Aragón, Galo Alexander
Resumen
The objective of this research work was to measure the effectiveness in terms of precision and quality of estimation presented by different imputation techniques for missing data, coming from a normal distribution. From the Monte Carlo method, a bivariate matrix structured by observed data and by missing data was created, where the missing values were developed through an established model. Representative samples of size 5, 10, 30 and 100 were simulated 100,000 times working with different percentages of information loss for the scenarios: Missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). The imputation techniques by elimination, mean, median and linear regression were applied, in which the adjustment of the data was diagnosed through a precision measure and it was verified if the imputed data maintain their estimation properties of unbiasedness and minimum variance., using the mean and variance estimators. Using the RStudio software, it was determined which linear regression is the most accurate in samples from 30, while the mean and median in small samples such as 5 to obtain values closer to the real data. The unbiasedness of the mean shows that the best technique is the imputation by linear regression, since its property is maintained in samples from 30 onwards. In the unbiasedness of the variance, the most viable technique in MAR and MCAR is elimination for samples of 30 and 100, while for MNAR in samples of any size. According to the minimum variance of the mean and variance, the technique that yielded a lower variance in most contexts is linear regression. It is recommended to extend the study using multiple imputation techniques and machine learning to diagnose better results.