Thesis
Compilaci on de un corpus paralelo espa~nol{ingl es alineado a nivel de oraciones
Autor
Posadas Dur an, Juan Pablo Francisco
Institución
Resumen
One line of research of Natural Language Processing focuses on parallel
texts alignment. The utility of aligned parallel texts is that it shows
explicitly the relationship between the elements in a text in one language
and elements of the same text translated into another language.
In this thesis, we propose a method for sentence alignment in parallel texts
written in Spanish and English, it uses lexical and statistical information in
a dynamic programming framework. The lexical information used is the one
contained in a bilingual Spanish-English dictionary limited (incomplete) and
for general purpose, as well as the sentence length measured in terms of words
and in terms of characters.
The proposed method was tested on a corpus of unbalanced literary texts
(texts in which the frequency of multiple alignments, omissions and insertions
is greater), where we reach a precision aobove the 90 %. We compared
our results obtained by the proposed method against those obtained by the
Vanilla aligner system (which uses a statistical approach)with the same corpus
and found that the developed method is superior, particularly in cases
of multiple alignments, omissions and insertions.
The results we obtained show that the use of lexical information contained
in a bilingual dictionary of general use and statistical information, make this
a robust method for sentence alignment in texts that don t have a technical
translation with respect to statistical methods alone.