Poisson approximation for search of rare words in DNA sequences

Vergne, N; Abadi, M

dc.creator	Vergne, N
dc.creator	Abadi, M
dc.date	2008
dc.date	2014-11-16T16:28:27Z
dc.date	2015-11-26T16:24:07Z
dc.date	2014-11-16T16:28:27Z
dc.date	2015-11-26T16:24:07Z
dc.date.accessioned	2018-03-28T23:05:09Z
dc.date.available	2018-03-28T23:05:09Z
dc.identifier	Alea-latin American Journal Of Probability And Mathematical Statistics. Impa, v. 4, n. 223, n. 244, 2008.
dc.identifier	1980-0436
dc.identifier	WOS:000208538900011
dc.identifier	http://www.repositorio.unicamp.br/jspui/handle/REPOSIP/70595
dc.identifier	http://www.repositorio.unicamp.br/handle/REPOSIP/70595
dc.identifier	http://repositorio.unicamp.br/jspui/handle/REPOSIP/70595
dc.identifier.uri	http://repositorioslatinoamericanos.uchile.cl/handle/2250/1268376
dc.description	Using recent results on the occurrence times of a string of symbols in a stochastic process with mixing properties, we present a new method for the search of rare words in biological sequences modelled by a Markov chain. We obtain a bound on the error between the distribution of the number of occurrences of a word in a sequence and its Poisson approximation. A global bound is already given by a Chen-Stein method. Our approach, the.-mixing method, gives local bounds. Since we only need the error in the tails of distribution, the global uniform bound of Chen-Stein is too large and it is a better way to consider local bounds. It is the first time that local bounds are devised for Poisson approximation. We search for two thresholds on the number of occurrences from which we can regard a studied word as an over-represented or an under-represented one. A biological role is suggested for these over-or under-represented words. Our method gives such thresholds for a panel of words much broader than the Chen-Stein method which cannot give any result in a great number of cases where our method works. Comparing the methods, we observe a better accuracy for the psi-mixing method for the bound of the tails of distribution. Our method can obviously be used in domains other than biology. We also present the software PANOW (available at http://stat.genopole.cnrs.fr/sg/software/panow/) dedicated to the computation of the error term and the thresholds for a studied word.
dc.description	4
dc.description	223
dc.description	244
dc.language	en
dc.publisher	Impa
dc.publisher	Rio De Janeiro
dc.publisher	Brasil
dc.relation	Alea-latin American Journal Of Probability And Mathematical Statistics
dc.relation	ALEA-Latin Am. J. Probab. Math. Stat.
dc.rights	aberto
dc.source	Web of Science
dc.subject	Poisson approximation
dc.subject	Chen-Stein method
dc.subject	mixing processes
dc.subject	Markov chains
dc.subject	rare words
dc.subject	DNA sequences
dc.title	Poisson approximation for search of rare words in DNA sequences
dc.type	Artículos de revistas

Este ítem pertenece a la siguiente institución

Universidade Estadual de Campinas (Brasil)