A practical upper bound on the number of sequences that can be analyzed with many popular comparative methods is ~103, especially if codon-substitution models are used. This number can be raised by several orders of magnitude, enabling the study of gene-sized alignments with 104 − 105 sequences, much more extensive model testing, or the implementation of more realistic models with added complexity.
We describe a relatively general approximation technique to limit the number of expensive likelihood function evaluations a priori, by discretizing a part of the parameter space to a fixed grid, estimating other parameters using much faster simpler models, and integrating over the grid using MCMC or a variational Bayes approach. With FUBAR, we demonstrate how this technique can achieve 100× or greater speedups for detecting sites subject to positive selection, while improving statistical performance. Other analyses where there are only a 2-3 parameters of interest (e.g. detection of directional selection in protein sequences) can be accommodated.
When discretization is not approproate, it is often possible to develop methods that employ variable parametric complexity chosen with an information theoretic criterion. For example, in the Adaptive Branch Site Random Effects model [aSBREL, 2], we quickly select and apply models of different complexity to different branches in the phylogeny, and deliver statistical performance matching or exceeding best-in-class existing approaches, while running an order of magnitude faster.