Popular clustering-based algorithms such as neighbor-joining have a worst case time-complexity that is Ω(n3). These approaches do not scale well with the large amount of data that is being made available with current sequencing technologies. Additionally, clustering-based methods are not as accurate as methods based on tree search that optimize criteria such as likelihood. Recent advances in this field have been in the direction of dividing the data into smaller overlapping sets, constructing subtrees on each small set, and then combining the subtrees in order to recover the full tree. Aside from having lower time complexity, these approaches reduce reconstruction error, when compared with reconstruction based on the entire dataset.
One notable such supertree method is CLGrouping, in which the relevant sequence sets are defined as the neighborhood of each internal vertex of a minimum spanning tree (MST) that is constructed using all pairwise distances.
Our contributions are as follows. We extend CLGrouping to the case where there are multiple MSTs. We developed a similar MST-based method that builds subtrees using family-joining, a method for constructing generally labeled trees. Generally labeled trees are unrooted phylogenetic trees which may have taxa placed at internal vertices, and which may contain polytomies. We demonstrate on simulated trees with 1000 taxa that CLGrouping's method of partitioning sequences yields very small sequence sets, resulting in a significantly higher reconstruction error, when compared with trees built using relatively larger sequence sets.
There seems to be a trade-off between the error of reconstructing subtrees and the error of partitioning sequences using edges in the MST. We are currently developing a method that selects the edges of the MST, for partitioning the sequences in such a way that reconstruction error is minimized. This approach allows accurate reconstruction while maintaining the high degree of parallelism that is inherent to the divide-and-conquer approach.