Poster Presentation Society for Molecular Biology and Evolution Conference 2016

Comprehensive Annotation of Multigenic Protein-Family Structures (CAMPS) (#564)

David Clarke , Lars S Jermiin 1 , Charles Robin 2
  1. CSIRO, Canberra, ACT, Australia
  2. The University of Melbourne, Parkville, VIC, Australia

The automated methods currently used for genomic annotation of gene structure often fall short of the desired degree of accuracy appropriate for analysis of multigenic protein-families within and between species.  These methods are restrained by a poor ability to distinguish the relationships of stretches of coding sequence to each other and an over reliance on a single inevitably flawed genome assembly. To increase the accuracy of gene structural annotation the CAMPS annotation methods focus on a single protein-family at a time.  In this context ‘Comprehensive’ refers to incorporating multiple lines of evidence from the target species, including multiple genome assemblies and different sources of transcriptome data. 

Genomic loci for the protein family in question are initially identified through sequence similarity with previously identified proteins from the target species or other species.  Additional evidence is taken from the transcriptomes of the target species and other species.  CAMPS clusters the identified loci and transcripts into ‘campsites’ through sequence similarity and genomic position.  A campsite is divided into ‘tents’, which include variants of suspected genes.  Annotation of the suspected gene is performed using all data within the tent.  Sequence variants within the tent are identified and then classified.

These methods are designed to; reduce the concatenation and splitting of genes; identify partial and duplicate gene sequences present within the dataset; and compare data across the available genomic assemblies and transcriptomes to identify where assembly errors have caused problems.  Accuracy estimates are generated for individual genes from the evidence used to derive the gene models.  CAMPS is also designed to present multiple sequence variants of genes when they are found in the dataset.