Oral Presentation Society for Molecular Biology and Evolution Conference 2016

Rapid identification of phylogenetically informative data from high-throughput sequencing reads (#182)

Rachel S Schwartz 1 , Reed A Cartwright 1
  1. Arizona State University, Tempe, ARIZONA, United States

Datasets for phylogenetics have grown dramatically in recent years. However, while large datasets contain a lot of information, they also contain noise. The challenge for phylogenomics is to extract information from large datasets rapidly and efficiently. We have developed easy-to-use, open-source software called SISRS (Site Identification from Short Read Sequences) to identify such data from raw reads, and have demonstrated the success of this approach. SISRS assembles a composite reference genome consisting primarily of loci that are conserved across species. Aligning reads to this genome and calling genotypes results in a large dataset of phylogenetically informative sites. We have evaluated approaches to generate the composite genome and thereby identify phylogenetically informative regions of the genome. To-date SISRS has been overly conservative in calling sites in order to avoid downstream effects of erroneous base calls due to error in sequencing and alignment. This approach results in significant loss of information. We have evaluated approaches to jointly genotype samples given read information to produce a larger number of accurately called genotypes. Additionally, we discuss the potential for including read information to jointly call the genotypes and phylogeny. By identifying conserved yet variable loci directly from raw sequence data, we can provide accurate alignments for phylogenetic analysis at any taxonomic level. New approaches to rapidly identify these loci and the sequence of each locus for each species will allow us to generate accurate, well-resolved phylogenies, particularly for non-model organisms lacking reference genomes.