bachelorThesis
Automatic determination of the learning rate for multivariate and multinomial regression models
Autor
Acosta Fajardo, Manuela
Institución
Resumen
Throughout the years, artificial intelligence has developed into a widely researched and applied field, as a result of the significant advancements in technology and the expansion in computer resources. Artificial intelligence attempts not only to understand how the human mind works, but also to develop systems that can mimic human behaviour. Machine learning is one of the main branches of artificial intelligence, and it aims to build and improve models that can learn from a set of data, and from experience, via computational methods, with no need to be explicitly programmed. Machine learning algorithms build models based on sample data, in order to make predictions or decisions, and are used in different applications, such as medicine, computer vision, image classification, among others. A machine learning algorithm is a program that finds patterns or makes predictions from previously unseen data. Depending on the goals of the algorithm, as well as on the data used, there are different types of learning models: supervised learning, unsupervised learning and reinforcement learning. One of the most common learning algorithms is Gradient Descent, which is used to find a local minimum of a differentiable function. It works by taking repeated steps in the opposite direction of the gradient of the function. The size of the steps taken by the gradient descent algorithm is determined by an hyper-parameter known as the Learning Rate. This parameter indicates how fast or slow is the movement towards the optimal parameters of the algorithm. Usually, it is set manually. However, in order to reach the function minima it is necessary to set an appropriate learning rate, i.e. neither too big, nor too small. In the first case, the steps taken are too big, and the algorithm can diverge as a consequence. On the contrary, if the learning rate is too small, it results in slow learning, and the algorithm could also never converge. Most of the times a fast learning is desired, so high learning rates might be selected. Nevertheless, it is important to select the proper value for this parameter, so one can guarantee the convergence of the algorithm. A method to determine an upper-bound for the learning rate of models based on linear regression models was presented in (2021, Ruiz), doing an analysis of the gradient descent algorithm as a discrete dynamical system. This thesis work aims to extend these results to models based in classification and multinomial regression. We also seek to find an optimal value for the learning rate for these methods. Throughout this thesis an algorithm that automatically determines an optimal value for the learning rate of classification and regression models is developed. In the first place, the results obtained for the linear regression models are generalized to other activation functions. As a result, an upper-bound and an optimal value for the learning rate are found for models using regression and classification. Then, the results obtained are extended to a multinomial regression model. We propose an analysis of the gradient descent as a discrete dynamical system, where the upper-bound arises as a criteria to determine the stability of this system. Besides, we present an optimal value for the learning rate, which minimizes the sum of the distance of the extreme poles of the dynamical system studied. This analysis is done by linearizing the gradient descent algorithm, and applying it to linear, logistic and multinomial regression. The upper-bound and the optimal value of the learning rate are approximations to the optimal value that guarantee the fastest convergence of the algorithm. We present simulations and experiments to test the results obtained. We first test them with toy examples, by manually creating the data to study the behaviour of the algorithm for the linear and the logistic regression model. Then, we validate our approach in real datasets. The results show that, although the maximum learning rate, which is given by the upper-bound, seems to make the algorithm converge faster than the optimal learning rate for the logistic and multinomial case, it is better to use this last value, as it guarantees a smooth and relatively fast convergence to the minimum in all cases