One aspect of interest when exploring genetic data can be detecting if the data comes from a structured population. For example, does the population come from a collection of geographically separated sub-populations, or does time better explain a shift in genetic signal? Principal Component Analysis (PCA) is a standard tool for exploring these types of relationships between genetic and demographic signal. However, when data includes ancient samples, sometimes only mtDNA can be successfully recovered. Classical unsupervised methods using PCA can not be applied to mtDNA, and so researchers are left without an efficient or well understood method for exploring data.
We suggest applying Multiple Correspondence Analysis (MCA) directly to mtDNA. MCA is an intuitive generalisation of PCA to categorical variables, and so can be used and presented in a similar way to standard nuclear DNA analyses. The result of applying MCA to mtDNA data is that we produce a quantitative representation of the categorical data, in fewer dimensions than the original alignment data. Importantly, MCA is an unsupervised method, and so no prior knowledge of the demographic structure of the population is required.
Using this method, we apply a medoid based clustering algorithm to explore genetic similarity and dissimilarity between individuals. We then show that the data, and the clustering structure, can be compared to any supplementary variables (such as latitude and longitude, time or culture) to test for significant correlation.
Finally, we apply our method to European human mtDNA spanning the Upper Paleolithic. Our method reveals evidence for a vast change in the human genetic landscape after the most recent ice age, consistent with other analyses.