TRAJECTORY FORECASTING FOR AUTONOMOUS VEHICLES

Juan Luis Baldelomar

info:eu-repo/semantics/other

Registro en:

http://cimat.repositorioinstitucional.mx/jspui/handle/1008/1166

https://repositorioslatinoamericanos.uchile.cl/handle/2250/7729703

Autor

Juan Luis Baldelomar

Institución

Centro de Investigación en Matemáticas (México)

Resumen

The following work is focused on the area of Trajectory Forecasting for autonomous driving vehicles. This problem has been tackled from different perspectives and for different contexts. Here, we focus on the prediction of trajectories for driving agents. The problem has two main dimensions that are relevant to accurately model it and obtain good results: the temporal dimension and the spatial dimension. Sequence-to-sequence models have been massively used to model the temporal dimension of the problem as we can see with models based on LSTM networks. In more recent approaches, Transformers attention mechanisms have been explored, after the incredible results they achieved on the NLP field. To model the spatial relationships among agents, graph neural networks have been proposed in several works. However, mechanisms like the one from the Transformer have not been widely explored to also model the spatial dimension of the problem, and we carried out that task as the main contribution of our work. We propose a model based on the Transformer architecture to tackle both, the temporal and spatial dimensions of the problem. We use two Transformer Encoders at two dimensions of the problem, the spatial and the temporal dimensions. The model receives as input a scene with all the neighbors present in it at specific time steps and outputs the prediction of all the trajectories for each agent in the scene, which means we are doing the joint prediction of all the agents rather than predicting the trajectory for each agent by itself. This allows us to take into consideration spatial relations in the sequence. The model works in the following way. The first Transformer along with a handcrafted CNN modules are used to extract spatial features. In this case the input is constructed in a way that we expect the first encoder to process spatial relations between the agents present in a scene. Those spatial features are then used as input for what we call a Temporal Transformer, because it works at the temporal dimension of the problem. This is achieved by doing a transposition of the temporal and spatial dimension of the output of the first encoder. The decoder then receives the output of the second encoder as a traditional Transformer model. The model is trained in an auto-regressive manner as the AgentFormer model~\cite{agentFormer} because it showed significant improvements over the more classical Teacher Forcing approach. We worked with two datasets to train and test