Many population genetics models assume that if a variant is observed twice, the two observations are a result of identity-by-descent inheritance. However, as the number of sequenced individuals grows, the probability of observing two or more independent mutational events occurring at the same site in the genome increases. Here, we describe an analysis of widespread mutational recurrence observed in exome sequence data from 60,706 individuals from the Exome Aggregation Consortium (ExAC). This effect is most pronounced among highly mutable CpG transitions, and in this dataset, we observe over 60% of all possible synonymous CpG mutations and begin to saturate detection of these variants.
We find that approximately one-third of high-confidence validated de novo variants identified in external datasets of parent-offspring trios are also observed independently in the ExAC dataset, indicating that the same variant has arisen multiple times independently.
This process has a marked effect on the frequency spectrum in the ExAC data, resulting in a depletion of very low-frequency variants at sites with high mutation rates, even for synonymous sites. Specifically, we observe a strong correlation between site mutability inferred from sequence and singleton rates, as well as between site mutability and the probability of observing the variant in two separate populations.
We demonstrate that these patterns are only observed at a sample size greater than approximately 20,000 individuals, indicating that ExAC is the first such dataset to observe this phenomenon. Finally, we propose a correction factor to properly account for the impact of mutational recurrence on the frequency spectra of various functional classes, which enables us to provide robust estimates of their deleteriousness. We note that with a moderately larger sample size, we will be able to infer selection against individual CpG variants.