Desenvolvimento de módulo de recursos lexicogramaticais baseado em regras para realização superficial em tarefas de geração de língua natural em português brasileiro
André Luiz Rosa Teixeira
Natural Language Generation (NLG), a sub-area of Natural Language Processing (NLP), is a research area that has been on the agenda of both Computer Sciences and Linguistics for nearly a century. As an area of interdisciplinary research by nature, NLP draws on different disciplines, one of which is Applied Linguistics. Drawing on Systemic-Functional Linguistics, the area of Multilingual Studies as proposed by Matthiessen et al. (2008) contemplates an integration of Linguistics and related fields of investigation, such as
Computational Linguistics. This field of investigation promotes the articulation of modes of integration: reflexive – theorizing and description of language production aiming at the contrast between languages and active – aiming at the application of the findings in the reflexive mode (such as the development of NLG programs), allowing the investigation of Computational Linguistics within the scope of SFL. The programming of lexicogramatical resources from different languages, such as English, German, Chinese, Spanish, and Brazilian Portuguese, for Natural Language Generation, dates back to the last century. Drawing on SFL, in Brazilian Portuguese, one initiative is available that models the spatial language in the domain of tourist texts, and models the lexicogramatical resources necessary for the realization of the clauses in the corpus of the study (see Oliveira (2013)). The resources developed in Brazilian Portuguese, are, thus, not domain independent. This thesis draws on Computational Linguistics within a Systemic-Functional Linguistics framework and Multilingual Studies (active mode of investigation) to explore Natural Language Generation as a resource to investigate meaning production processes. More specifically, this thesis aims primarily at developing a rule-based domain independent textual realization module that covers the lexicogramatical rank scale of Brazilian Portuguese, for applications in NLG. Furthermore, this thesis also aims to carry out experiments that contrast the acuracy of the ruled based system and artificial neural networks developed within CoNLL- SIGMORPHON (COTTERELL et al., 2017; COTTERELL et al., 2018) shared task for verbal inflection in domain independent dev and test corpora of verbs; also, to apply the rule based module for the sub-task of textual realization of verbs in the pipeline of a local instance of the robot journalist @DaMataReporter. The programming of the lexicogramatical resources for textual realization of Brazilian Portuguese was carried out in Python programming language. The development made use of a trinocular perspective: “from above”- examining semantic figures realized by clauses in the lexicogramatical stratum; “from below” - examining graphological patterns in the stratum of expression and modeling the constituency patterns along the rank scale, whereby units of a given order function in the order immediately above on the scale; and “from roundabout” - modeling the systems that organize the meanings in each order of the rank scale. The programming of the resources for superficial realization drew on available Systemic-
Functional descriptions of Brazilian Portuguese, and on relatively congruent systems ofEnglish when such descriptions were not available. The domain selected for the application of the rule-based module developed in this thesis is the deforestation of the Legal Amazon in the Brazilian territory, a sensitive and pressing matter broadly discussed internationally. @DaMataReporter, a robot-journalist which posts data on the deforestation Of the Amazon, is part of initiatives that aim at generating text from publicly available data, to raise awareness about sensitive matters. This research has the potential to contribute to a) descriptive research: validating the Systemic-Functional descriptions of systems in Brazilian Portuguese; b) theoretical research: insofar as it tests and validates descriptions that draw on Systemic-Functional theory; c) applied research: in the educational field – offering results that inform translator training; basis for contrastive analysis of texts in contact through translation; language teaching; description and linguistic theory. This research enabled the computational implementation of the main systems that organize units in the lexicogrammar of Brazilian Portuguese: at word rank – the verb and the noun, and functions for the realization of adverb and prepositions; at group rank, the main systems that organize the nominal group: the taxonomy that organize the Ente (Thing), and the systems of determinação (determination), classificação (classification), qualificação (qualification) and quantificação (quantification); and the verbal group: tipo
de evento (event type), agência (agency), finitude (fineteness), tempo secundário (secondary tense), aspecto verbal (aspect), and dêixis modal (modal deixis), as well as functions for the construction of prepositional phrases and adverbial groups; at clause rank, the main systems: transitividade (transitivity); modo (mood: partially implemented – declarative and polar interrogative options); tema
(theme: partially implemented – options for tema_default (default theme)). Results showed that rule-based development of lexicogramatical resources for domain independent textual realization of Brazilian Portuguese, in the scope of Systemic-Functional theorization of Computational Linguistics, can be a long-term productive alternative, as it allows a greater control at this stage of the language generation process in the pipeline, specially in the application in sensitive domains.