Understanding the factors influencing mutation can improve mutation detection techniques, identify diagnostic signatures of disease-causing mutagens, and facilitate the development of more accurate models of genetic divergence. Hypermutability of CpG demonstrates the existence of mutation motifs, sequences of flanking bases that influence point mutation processes. These motifs can thus be indicative of specific mutation mechanisms. Here, we report novel log-linear models for identifying mutation motifs that further allows comparisons of these, and of the complete mutation spectra, between samples. Mutation motifs are visualised using a sequence logo type method.
We applied the methods to examination of each of the possible 12 point mutations in ~13.6 million human germline mutations (inferred from SNPs recorded in ENSEMBL) and ~181 thousand melanoma mutations from the COSMIC database.
Our method recovered the well known CpG effect which a conventional motif detection method failed to do. We establish that all point mutations have significant and distinct mutation motifs. While the major effects of flanking bases lie within 2bp of the mutated position, we refute previous reports that the effect magnitude decays monotonically with distance. Comparison between autosomes and X-chromosome supported a reduced contribution from methylation induced C→T mutation on the X-chromosome, consistent with a previous prediction.
Analyses of malignant melanoma confirmed reported characteristic features of this cancer. This included strand asymmetry of mutation processes and that neighbouring influences in malignant melanoma differ significantly from those affecting germline mutations. Interestingly, the CpG effect was largely subsumed by different neighbouring mechanisms.
The statistical methods we report can be used to examine the role of flanking sequence on mutation processes from polymorphism data. They further enable identifying differences in the operation of mechanisms of mutation between genomic regions, cell types or species. Our results have important implications for modelling context-dependent effects on sequence evolution.