Speech recognition using deep neural networks trained with non-uniform frame-level cost functions

Becerra de la Rosa, Aldonso; De la Rosa Vargas, José Ismael; González Ramírez, Efrén; Pedroza Ramírez, Ángel David; Martínez, Juan Manuel; Escalante, Nivia

dc.contributor	https://orcid.org/0000-0002-7337-8974
dc.contributor	https://orcid.org/0000-0002-8060-6170
dc.creator	Becerra de la Rosa, Aldonso
dc.creator	De la Rosa Vargas, José Ismael
dc.creator	González Ramírez, Efrén
dc.creator	Pedroza Ramírez, Ángel David
dc.creator	Martínez, Juan Manuel
dc.creator	Escalante, Nivia
dc.date.accessioned	2020-05-06T20:42:07Z
dc.date.available	2020-05-06T20:42:07Z
dc.date.created	2020-05-06T20:42:07Z
dc.date.issued	2017-11
dc.identifier	2573-0770
dc.identifier	http://ricaxcan.uaz.edu.mx/jspui/handle/20.500.11845/1894
dc.identifier	https://doi.org/10.48779/9ds7-t936
dc.description.abstract	The aim of this paper is to present two new variations of the frame-level cost function for training a Deep neural network in order to achieve better word error rates in speech recognition. Minimization functions of a neural network are salient aspects to deal with when researchers are working on machine learning, and hence their improvement is a process of constant evolution. In the first proposed method, the conventional cross-entropy function can be mapped to a nonuniform loss function based on its corresponding extropy (a complementary dual function), enhancing the frames that have ambiguity in their belonging to specific senones (tied-triphone states in a hidden Markov model). The second proposition is a fusion of the proposed mapped cross-entropy and the boosted cross-entropy function, which emphasizes those frames with low target posterior probability. The developed approaches have been performed by using a personalized mid-vocabulary speaker-independent voice corpus. This dataset is employed for recognition of digit strings and personal name lists in Spanish from the northern central part of Mexico on a connected-words phone dialing task. A relative word error rate improvement of 12.3% and 10.7% is obtained with the two proposed approaches, respectively, regarding the conventional well-established crossentropy objective function.
dc.language	eng
dc.publisher	IEEE
dc.relation	generalPublic
dc.rights	http://creativecommons.org/licenses/by-nc-nd/3.0/us/
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 Estados Unidos de América
dc.source	Proc. of the IEEE International Autumn Meeting on Power, Electronics and Computing (ROPEC2017), at Ixtapa, Mexico, pp. 1-6, 2017.
dc.title	Speech recognition using deep neural networks trained with non-uniform frame-level cost functions
dc.type	info:eu-repo/semantics/conferencePaper

Este ítem pertenece a la siguiente institución

Universidad Autónoma de Zacatecas (México)