info:eu-repo/semantics/conferencePaper
Speech recognition using deep neural networks trained with non-uniform frame-level cost functions
Fecha
2017-11Registro en:
2573-0770
Autor
Becerra de la Rosa, Aldonso
De la Rosa Vargas, José Ismael
González Ramírez, Efrén
Pedroza Ramírez, Ángel David
Martínez, Juan Manuel
Escalante, Nivia
Institución
Resumen
The aim of this paper is to present two new variations of the frame-level cost function for training a Deep neural network in order to achieve better word error rates in speech recognition. Minimization functions of a neural network are salient aspects to deal with when researchers are working on machine learning, and hence their improvement is a process of constant evolution. In the first proposed method, the conventional cross-entropy function can be mapped to a nonuniform loss function based on its corresponding extropy (a complementary dual function), enhancing the frames that have ambiguity in their belonging to specific senones (tied-triphone states in a hidden Markov model). The second proposition is a fusion of the proposed mapped cross-entropy and the boosted cross-entropy function, which emphasizes those frames with low target posterior probability. The developed approaches have been performed by using a personalized mid-vocabulary speaker-independent voice corpus. This dataset is employed for recognition of digit strings and personal name lists in Spanish from the northern central part of Mexico on a connected-words phone dialing task. A relative word error rate improvement of 12.3% and 10.7% is obtained with the two proposed approaches, respectively, regarding the conventional well-established crossentropy objective function.