Oral Presentation Society for Molecular Biology and Evolution Conference 2016

Beyond software tuning: scaling up comparative coding sequence analysis using approximations and models that adapt their complexity to the data. (#161)

Sergei LK Pond 1 , Spencer V Muse 2 , Ben Murrell 3
  1. Temple University, Philadelphia, PA, United States
  2. Statistics, North Carolina State University, Raleigh , NC, United States
  3. Medicine, University of California , San Diego, CA, United States

A practical upper bound on the number of sequences that can be analyzed with many popular comparative methods is ~103, especially if codon-substitution models are used. This number can be raised by several orders of magnitude, enabling the study of gene-sized alignments with 104 − 105 sequences, much more extensive model testing, or the implementation of more realistic models with added complexity.

We describe a relatively general approximation technique to limit the number of expensive likelihood function evaluations a priori, by discretizing a part of the parameter space to a fixed grid, estimating other parameters using much faster simpler models, and integrating over the grid using MCMC or a variational Bayes approach. With FUBAR[1], we demonstrate how this technique can achieve 100× or greater speedups for detecting sites subject to positive selection, while improving statistical performance. Other analyses where there are only a 2-3 parameters of interest (e.g. detection of directional selection in protein sequences) can be accommodated.

When discretization is not approproate, it is often possible to develop methods that employ variable parametric complexity chosen with an information theoretic criterion. For example, in the Adaptive Branch Site Random Effects model [aSBREL, 2], we quickly select and apply models of different complexity to different branches in the phylogeny, and deliver statistical performance matching or exceeding best-in-class existing approaches, while running an order of magnitude faster.

 

 

 

 

  1. Ben Murrell, Sasha Moola, Amandla Mabona, Thomas Weighill, Daniel Sheward, Sergei L. Kosakovsky Pond, and Konrad Scheffler FUBAR: A Fast, Unconstrained Bayesian AppRoximation for Inferring Selection Mol Biol Evol (2013) 30 (5): 1196-1205
  2. Martin D. Smith, Joel O. Wertheim, Steven Weaver, Ben Murrell, Konrad Scheffler, and Sergei L. Kosakovsky Pond Less Is More: An Adaptive Branch-Site Random Effects Model for Efficient Detection of Episodic Diversifying Selection Mol Biol Evol (2015) 32 (5)