One of the characteristic features of ancient DNA is its degradation into short fragments. Advances in ancient DNA extraction and library preparation methods now make it possible to retrieve extremely short fragments, in principle improving access to highly degraded, ancient material. However, ultra-short sequences remain a challenge for analysis since it is not trivial to distinguish endogenous sequences from microbial contaminants, which typically constitute the vast majority of sequences recovered from ancient fossils.
To explore the utility of ultra-short sequences, we developed a method to estimate the proportion of spurious alignments to the human reference (hg19) genome and applied it to Neandertal samples of various ages with different proportions of endogenous DNA. The method is based on modifying the hg19 genome at random sites in non-repetitive, mappable regions. Sequence alignments overlapping mutated sites can then be classified as spurious or authentic alignments based on their sharing of the mutant or non-mutant state.
The proportion of spurious alignments decreases with increasing read length, depends on the relative abundance of microbial contaminant sequence, and is reduced by using only sequences with terminal C-to-T substitutions (i.e. showing evidence of deamination-induced base damage). Using this and other filters we define lower size cut-offs between 25 to 31 base-pairs (bp), depending on the specimen, while limiting the fraction of spurious alignments to less than 10%. When using only these short sequences in phylogenetic analyses, we observe no significant difference compared to using sequences of at least 35 bp, which is the size cut-off used in previous studies of archaic human DNA. By including shorter sequences, we considerably increase the amount of sequence information that can be recovered from highly degraded DNA. Our method may help to make samples available for genetic analyses that previously yielded too few or no informative DNA sequences.