Poster Presentation Society for Molecular Biology and Evolution Conference 2016

Estimating Identical-By-Descent tracts from low coverage NGS data (#554)

Filipe G. Vieira 1 , Anders Albrechtsen 2 , Rasmus Nielsen 3
  1. Natural History Museum of Denmark, Copenhagen, NA, Denmark
  2. Department Biology, University of Copenhagen, Copenhagen, Denmark
  3. Department Integrative Biology, University California Berkeley, Berkeley, USA

Genome-wide patterns of Identical-By-Descent (IBD) tracts and their variation across individuals provide a valuable insight into human genetic diversity and evolutionary history. Methods have been developed to infer these tracts but they are based on marker/genotype data, due to the low error rates. Next Generation Sequencing (NGS) technologies have revolutionized research in evolutionary biology by both increasing speed and reducing costs. However, these data typically have high error rates due to multiple factors (from random sampling of homologous alleles, to sequencing or alignment errors) and, furthermore, many studies rely on low coverage sequence data (< 5× per site per individual), causing SNP and genotype calling to be associated with considerable statistical uncertainty.

Recent methods rely on probabilistic frameworks to account for these errors, integrating the base quality score together with other error sources to calculate an overall ”genotype likelihood”. This likelihood function can then be combined with a prior to calculate a posterior probability for the genotype. Here, we present a new Hidden-Markov-Model based method to estimate IBD tracts, specially suited to low coverage NGS data since it takes the uncertainty in the data into account by working with genotype likelihoods. Furthermore, and apart from the IBD tracts, this new method also estimates genome-wide inbreeding coefficients that can be used as priors in other analyses. We assess its performance both on simulated data and a subset of the 1000 genomes, looking into several combinations of sample size and coverage, and show accurate inferences for sequencing coverages as low as 2x.