RNA has been argued to provide a window back to the earliest stages in the evolution of life on Earth. However, fewer than 1% of known RNA families are conserved across all three domains of life. For RNAs found only in a single domain, the picture is not much better: only 3.5% of RNA families show a distribution consistent with tracing back to the ancestor of the domain in which they are found. I will discuss two possible explanations for this result:
1. That most RNAs are evolutionarily young, and the result of ongoing de novo emergence of small RNAs from genomic noise.
2. That many more RNAs are old, but better data are needed to find these.
We have found that current data are insufficient to distinguish these two possibilities. Sequence conservation of RNA genes drops off precipitously quickly compared to protein-coding genes, and existing covariance-based search strategies therefore perform poorly on the skewed distribution of public genomic data - which is dominated by genomes of humans and their pathogens. We find there is a ‘Goldilocks Zone’ for comparative analysis of RNAs, where, for optimal identification of RNA genes, comparisons between genomes that are not too similar and not too distant yield rich information on noncoding RNAs. Unfortunately almost no transcriptomics data collected to date sit within the Goldilocks Zone, meaning we cannot gauge the age of most RNA genes. Moreover, we cannot detect these for the most studied lineages, as sampling is too narrow.
While we now know data are poor, I will nevertheless put a stake in the ground, and present our current thoughts on how RNA genes originate and evolve.