Genetic variation within the malarial parasite Plasmodium falciparum affects key phenotypes including drug resistance and risk of severe disease. Advances in technology and experimental protocol mean that obtaining high coverage genome sequencing data from routine blood samples taken in the field is now possible. However, interpreting such data is difficult because of high rates of mixed infections and highly variable data quality.
To provide a framework for analysing genetic variation in P. falciparum, the Pf3k project is working to build a global reference map of genome sequence and tools that can enable rapid analysis of data from field samples, with 2,512 samples available to date.
Here, we describe and validate methods for inferring the structure and identity of strains present in a sample by combining a reference panel of known haplotypes with data from an additional sequencing experiment. In particular, we describe Monte Carlo methods for inferring haplotypes present in a sample that generalise techniques developed for diploid samples, but which can cope with multiple strains and the over-dispersion of allele counts that results from experimental protocol. The approach is validated through analysis of experimentally generated mixed samples.
When applied to the Pf3k data, our approach demonstrates substantial variation in local parasite population dynamics. For example, we find that while 642 out of 934 cases from Asia present evidence of infection by a single parasite strain, this the case for only 657 out of 1490 cases from Africa. Moreover, we find evidence for substantial within-continent variation in population structure, indicative of epidemiological heterogeneity and the effects of drug-induced selection pressure.
Our results demonstrate the feasibility of inferring genome-wide patterns of haplotype structure in malarial parasites taken from clinical field samples and establish a resource for driving the development of new approaches for integrating population genetic and epidemiological modelling.