A key application of phylogenetic methods is to characterize natural selection on proteins. However, their success depends on how well they capture evolutionary constraints. The dominant method of selecting models for an analysis is to compare statistical fit to observed data. Unfortunately, this provides limited information on which properties of coding sequences selection acts on.
Codon models typically assume amino acid changes are, on average, selectively disfavoured without considering physiochemical amino acid properties. Empirical protein models that consider exchangeabilities from alignments seem comparatively appealing, and may incorporate between-site heterogeneity to reflect intramolecular variation in constraints. Nevertheless, they often fit data no better than codon models. This needn't suggest that biophysical amino acid properties are unimportant as they also disregard site specific preferences. Given these limitations, we approach developing more structurally aware models from two angles: First, we consider Mutation-Selection models, allowing estimation of selective coefficients associated with different classes of amino acid change and site-specific preferences. Second, discussed here, we examine how well available models capture structural constraint at phylogenetically relevant timescales.
We therefore perform forward simulations on the SH2 domain, assessing deviation of the evolved sequence from the native structure. For each time-point, we predict the evolved structure using Rosetta and determine the RMSD. Selection criteria include: a) Exchangeabilities from LG08+4dG; b) Physicochemical distances (Grantham 1974); c) Site specific amino acid preferences (Rodrigue et al., 2010); d) Fold stability based on contact affinities (Miyazawa and Jernigan, 1985), incorporating heterogeneity and epistasis; e) A coarse-grained biophysical model (Grahnen et al., 2012). Additionally, we examine how LG08 performs with and without heterogeneity, providing insight into how rate variation contributes to more realistic models. While some of these methods have seen limited use in a phylogenetic context, comparing them with established approaches allows us to take a broad view on useful selection criteria for future models.