Poster Presentation Society for Molecular Biology and Evolution Conference 2016

Sequence uniqueness determines the accuracy of isoform resolvability from short read RNA-seq data   (#454)

Jeremy RB Newman 1 , Ana Conesa 2 3 , Lauren M McIntyre 1 2
  1. Department of Molecular Genetics and Microbiology, University of Florida, Gainesville, Florida, United States
  2. Genetics Institute, University of Florida, Gainesville, Florida, United States
  3. Department of Microbiology and Cell Science, Institute for Food and Agricultural Sciences, University of Florida, Gainesville, Florida, United States

There is a growing recognition of isoform-specific effects in biological processes and disease. Algorithms have been developed to estimate isoform abundance from short-read RNA sequencing data. However their performance varies wildly, implying that current algorithms cannot resolve the uncertainty inherent in isoform reconstruction. The use of full-length transcripts obtained through PacBio sequencing can reduce uncertainty about the reference transcriptome, although this is prohibitively expensive as a high-throughput option. We sought to leverage evidence provided by long-read references to determine the exact resolution of isoform-level information retrievable from short-read sequencing. We developed an algorithm to identify the fragments within a gene that are unique to an isoform or common to multiple isoforms, and use this to quantify the uncertainty in reads assigned to specific isoforms. We tested this algorithm by using short-read sequencing from mouse to determine what isoforms are resolvable, and which are indistinguishable, and compared this is to PacBio sequencing obtained from the same samples. We tested whether isoform resolvability was influenced by read length and transcriptome complexity by simulating reads from mouse, human and Drosophila transcripts. Transcripts with highly similar sequences are not able to readily distinguishable, and sequence uniqueness is dependent on read length and transcriptome complexity, with longer reads and fewer isoforms per gene both increasing sequence uniqueness and isoform resolvability. We conclude that the resolvability of isoforms from short-read RNA-seq data is highly dependent on the identification of sequence uniqueness, and that the transcriptome-wide resolution of isoforms is not possible from short read data alone.