Actas de congresos
Using a hybrid approach to build a pronunciation dictionary for brazilian portuguese
Fecha
2014-09Registro en:
Annual Conference of the International Speech Communication Association, 15th, 2014, Singapore.
1990-9770
Autor
Mendonça, Gustavo
Aluisio, Sandra Maria
Institución
Resumen
This paper describes the method employed to build a machinereadable pronunciation dictionary for Brazilian Portuguese. The dictionary makes use of a hybrid approach for converting graphemes into phonemes, based on both manual transcription rules and machine learning algorithms. It makes use of a word list compiled from the Portuguese Wikipedia dump. Wikipedia articles were transformed into plain text, tokenized and word types were extracted. A language identification tool was developed to detect loanwords among data. Words’ syllable boundaries and stress were identified. The transcription task was carried
out in a two-step process: i) words are submitted to a set of transcription rules, in which predictable graphemes (mostly consonants) are transcribed; ii) a machine learning classifier is used to predict the transcription of the remaining graphemes (mostly vowels). The method was evaluated through 5-fold cross-validation; results show a F1-score of 0.98. The dictionary and all the resources used to build it were made publicly available.