Background: Co-circulation of different microbial strains in the same host population/region is common (e.g. influenza virus). Although high-throughput sequencing (HTS) is very efficient, its relatively short sequencing reads require assembly to obtain complete/longer biological sequences for downstream analysis, which is a challenge in samples with co-infections/co-existence of multiple genetically similar microbial strains. Typical de novo assembly methods, which largely rely on overlapping regions between reads, have a high risk of mis-assembling the short reads from similar strains into recombinant sequences. Conventional reference-based assembly methods rely on pre-selection of correct genome sequences as reference templates, which is often difficult.
Methods: To address this problem, we have proposed and implemented an algorithm to efficiently and accurately assemble short reads into genomes of different strains with the aid of phylogenies built from database sequences. This method is template-selection free and is expected to be less erroneous than de novo assembly that relies on overlapping regions. Here, we demonstrated the utility of this novel method for influenza A virus samples. Mock co-infections were generated by mixing two or more sets of HTS reads simulated from different influenza virus strains, which were then subject to assembly by our phylogeny-based method, conventional de novo (Velvet) and reference-based (Bowtie2, BWA) methods, for performance comparison.
Results: The coverage and accuracy of the genomes assembled by our algorithm were as high as using reference-based assembly when the correct number and strains of reference genomes were known. Our method outperformed de novo methods and reference-based methods when incorrect strains were used as templates. Our algorithm also simultaneously determined phylogenetic positions of the assembled genomes in the global phylogeny. These results show that our phylogeny-based method is a useful alternative to other existing assembly methods.