The Ensembl and Ensembl Genomes projects create and distribute genome annotations for a wide range of genomes, including model organisms. The number of publicly available genomes is increasingly rapidly, providing an opportunity for new insights via comparative genomics. TreeFam produces phylogenetic trees and orthology predictions, though previously only for metazoans. Here we describe advances in TreeFam, targeted to achieve scalability to all Ensembl eukaryotes.
The key component is a new library of HMM (Hidden Markov Model) profiles that was created from Panther and TreeFam, with custom profiles to fill gaps in gene coverage. The library represents gene families across all eukaryotes.
We have designed a new workflow that uses this library to classify protein sequences from thousands of genomes into families in a quick and robust manner. The workflow’s full-build mode generates phylogenetic trees and orthologies anew across all species. The faster ‘update’ mode inserts data from new species or new gene annotations into the existing phylogenetic trees and orthologies.
The first step in this new workflow is to match incoming protein sequences to our library of gene families. For each family, we then create a multiple sequence alignment which is used to infer the best amino-acid replacement model and to reconstruct a phylogenetic tree. Each phylogenetic tree is reconciled with a species tree in order to infer consistent homology relationships following the speciation and duplication events reported. In update mode, only those alignments with new protein data are recomputed. The whole workflow is fully automated using eHive, our standard pipeline management system.
Gene families produced by TreeFam’s new workflow were released for vertebrates in Ensembl 84 (March 2016). The results can be viewed on our website at www.ensembl.org.