bachelorThesis
Classificador de legibilidade de textos em língua inglesa
Fecha
2021-08-17Registro en:
SANGE, Levi Matheus Martins. Classificador de legibilidade de textos em língua inglesa. 2021. Trabalho de Conclusão de Curso (Bacharelado em Ciência da Computação) - Universidade Tecnológica Federal do Paraná, Medianeira, 2021.
Autor
Sange, Levi Matheus Martins
Resumen
The English language has reached the level of a global or globalized language, that is, chosen as an intermediary among communications worldwide due to its characteristics such as extensive vocabulary, combination of other languages, ease of learning. Due to this status and power achieved, fluency in this language has been a requirement in several sectors and areas. Consequently, there was an increase in the number of people interested in obtaining proficiency and mastery. Some of the ways to improve skills are through reading and listening, readings and texts allow the discovery of various aspects of the language, however, looking for a reading that is close to the reader’s level and knowledge of the English language, without help, can be very demotivating. However, this search has been facilitated with the advancement of artificial intelligence and natural language processing techniques, which allow, together with text datasets, to generate results as content characteristics, allowing them to be categorized. For the scope of this work, Artificial Intelligence and Machine Learning techniques were used with algorithms such as Naive Bayes, Support Vector Machine and Decision Trees, to generate classifications and Natural Language Processing on two datasets freely available on the Internet. These datasets have thousands of words. The main objective of this work was to develop a readability classifier for texts in English, based on the application of the aforementioned machine learning algorithms. Metrics and textual characteristics were analyzed, such as number of syllables, frequency of long and complex words, readability formulas, extracted from each articles in the Wikipedia and Simple Wikipedia datasets. An accuracy of 94,17% was achieved through the training of the algorithms, with emphasis on the J48 Decision Tree algorithm, highlighting as important textual attributes, the frequency of complex words, the Gunning Fog index, auxiliary and to be verbs. Some items such as pre-processed datasets and scripts were also generated and made available for free in an online repository in order to contribute to future research and work in the area. Through the use of the developed classifier, it is possible to build, for example, tools and content recommendation systems for users who learn English as a second language or for people interested in developing their reading skills. With these results, more space was opened up for research and development of complementary free tools in the area of readability and intelligibility of texts through Natural Language Processing and Machine Learning.