info:eu-repo/semantics/masterThesis
A domain adaptation method for text classification based on self-adjusted training approach
Autor
IVAN GARRIDO MARQUEZ
Resumen
Information grows rapidly everyday, most of this information is kept in digital
text documents, web-pages, posts on social networks, blogs, e-mails [39],
electronic books [17], and scientific publications [28]. Organizing and categorizing
all this text information automatically results helpful for many
tasks. Supervised learning is the most successful approach for automatic
text classification. Supervised learning assumes that the training and test
set come from the same distribution. Sometimes there are not labeled data
available on the target domain, instead we have a labeled data set from
a similar or related domain that we can use as auxiliary domain. Despite
domains are similar, their feature space and the distribution are different,
hence the performance of a supervised classifier demeans. This situation is
called the domain adaptation problem. The domain adaptation algorithms
are designed to narrow the gap between the target domain distribution and
the auxiliary domain distribution. The semi-supervised technique of selftraining
allows to iteratively enrich the training test with data from the test
set. Using self-training for domain adaptation presents some challenges in
the text classification scenario; first, the feature space changes on each iteration
because new vocabulary is transferred from the target domain to the
training set, second, a way to select the more confidently labeled instances
is needed, because adding wrong labeled instances to the training set will
affect the model. Many of the methods addressing this problem need user
defined parameters like the number of instances selected per iteration or the
stop criteria. Tuning these parameters into a real problem is another problem
by itself. On this work we propose a self-adjusting training approach
method, which is able to adapt itself to the new distributions obtained on
a self-training process. This method integrates some strategies to adjust its
own settings each iteration. The proposed method obtains good results on
the thematic cross-domain text classification task, it reduces the error rate
in 65.13% on average from the supervised learning approach on the testing
dataset. It also was tested in the cross-domain sentiment analysis, reducing
the error rate by 15.62% on average from the supervised learning approach
on the testing dataset. The performance obtained in the evaluation of the
proposed method is competitive with other state of the art methods.
Materias
Ítems relacionados
Mostrando ítems relacionados por Título, autor o materia.
-
Compendio de innovaciones socioambientales en la frontera sur de México
Adriana Quiroga -
Caminar el cafetal: perspectivas socioambientales del café y su gente
Eduardo Bello Baltazar; Lorena Soto_Pinto; Graciela Huerta_Palacios; Jaime Gomez -
Material de empaque para biofiltración con base en poliuretano modificado con almidón, metodos para la manufactura del mismo y sistema de biofiltración
OLGA BRIGIDA GUTIERREZ ACOSTA; VLADIMIR ALONSO ESCOBAR BARRIOS; SONIA LORENA ARRIAGA GARCIA