Dissertação de Mestrado
Impacto de Comunidades Sociais Online no Aprendizado de Idiomas
Fecha
2019-05-02Autor
Rafael Sales Medina Ferreira
Institución
Resumen
Reddit is a social network where users interested in a common subject may subscribe to communities, known as subreddits, were they can share content such as links, text and images as posts. Other users in the community can, in turn, comment and rate such posts. Subreddits for second language learning have been drawing attention from groups of users of the most diverse proficiency levels, who use these communities for sharing questions and tips on improving their level. It is important to analyse the content shared on such communities and understand how users interact in order to guide the design of new tools that can improve language learning as well as user experience. In this project, we analyse the network of interactions in subreddits for language learning, namely German, English, French and Spanish. We analyse the network of interactions in these subreddits and show that users are more focused on discussing the subject than on interpersonal interactions. This analysis also shows that most of the relationships between users are weak ties: when two users interact with each other, that interaction typically does not reoccur many times. This conclusion is corroborated by our analysis of centrality metrics of complex networks, which show that these networks do not share common features with traditional online social media. Moreover, we analyse threads and show that discussion topics do not have long reply threads. Instead, threads usually have many answers to the first post, none of those leading to longer discussions. Using subreddit German, where users are asked to inform their proficiency level, we show that a large fraction of the interactions take place between users of different proficiency levels. We use LIWC to extract linguistic features from posts published in this subreddit, which indicates that users with different proficiency levels write text with distinct textual features. This observation led us to investigate whether it is possible to categorize users proficiency based on their publications. Unfortunately, traditional classification models such as KNN and logistic regression result in low accuracy when predicting proficiency from textual features extracted from individual Reddit publications. To address this problem, we propose a new model, called SEMPLICe (SEquential Model for Proficiency cLassifICation), which considers both text features and users history of publications to classify proficiency level throughout time. Based on the premise that proficiency levels do not decrease, as long as the users are active on the interaction, SEMPLICe improves the F1 metrics from classic methods up to 29.6%. SEMPLICe uses dynamic programming in order to obtain linear complexity on the users interaction line.