Tracing the phylogenetic distribution and, thus, the evolution of protein interaction networks across hundreds or even thousands of species calls for reliable and scalable methods for functional annotation transfer. Standard homolog or ortholog inferences resulting in so called ‘phylogenetic profiles’ do not suffice in many cases, as the functional similarity between evolutionary related sequences decays as a function of time. Here, we present HaMStR_OneSeq, a novel method that aids in the search for functionally equivalent proteins even over large evolutionary distances. The program integrates a targeted ortholog search with a subsequent assessment of the feature architecture similarity (FAS) between the proteins. Features comprise, among others, functional protein domains, secondary structure elements, transmembrane domains and low complexity regions. In detail, orthologs are identified in an iterative procedure starting from a single gene of interest - the ‘seed protein’. Ortholog candidates are then weighted according to their pairwise FAS when compared to the ‘seed protein’. In the cases of overlapping, redundant annotations in the architecture, we obtain the highest scoring linear paths through the graph using, where applicable, a greedy, and alternatively an exhaustive or a heuristic approach. The resulting score of an identified ortholog serves then as a proxy of its functional equivalence to the respective ‘seed protein’. A dynamic visualization tool enables the user to visualize and explore the resulting ‘feature-aware’ phylogenetic profile.
To demonstrate the application of HaMStR_OneSeq, we traced the DNA uptake machineries of five naturally competent bacteria in more than 1,000 species. The aim was to shed light on the distribution and evolution of natural competence in the bacterial domain. The prediction of hitherto unknown naturally competent bacteria with high confidence indicated that the capability of direct DNA uptake is far more common among bacteria than acknowledged to date.