Poster Presentation Society for Molecular Biology and Evolution Conference 2016

Averaging over alternative multiple sequence alignments increases the accuracy of phylogenetic tree reconstruction (#537)

Haim Ashkenazy 1 , Itamar Sela 1 , Giddy Landan 2 , Tal Pupko 1
  1. The Department of Cell Research and Immunology; George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
  2. Genomic Microbiology Group, Institute of Microbiology, , Christian-Albrechts-University of Kiel, Kiel, Germany

The classic methodology of inferring a phylogenetic tree from sequence data is composed of two steps. First, a multiple sequence alignment (MSA) is computed. Then, a tree is reconstructed assuming the MSA is correct. However, inferred MSAs have been shown to be inaccurate and errors in them reduce tree inference accuracy. It was previously proposed that filtering unreliable alignment regions can increase tree accuracy inference. However, it was also demonstrated that the benefit of this filtering is often obscured by the resulting loss of phylogenetic signal.

In this work we explored an ad-hoc approach, in which instead of relying on a single MSA, we generate a large set of alternative MSAs using the GUIDANCE2 methodology and concatenate them into a single super-MSA. By doing so, we account for phylogenetic signals contained in columns that are not present in the single MSA produced by alignment algorithms. Using simulations, we demonstrate that this approach results in more accurate trees compared to (1) using an un-filtered alignment; (2) using a single alignment with weights assigned to columns according to their reliability. Next, we explore in which regions of the MSA space our approach is expected to be beneficial. Finally, we provide a simple criterion for deciding whether or not the extra effort of computing a super-MSA and inferring a tree from it is beneficial. We expect our methodology to be useful for many cases in which relatively diverged sequences are analyzed and applying the more computationally intensive statistical alignment approach is not feasible.