Poster Presentation Society for Molecular Biology and Evolution Conference 2016

Haplotype reconstruction from Short Read Sequences using Vector Quantization (#549)

Louis Ranjard 1 , Allen Rodrigo 1
  1. The Australian National University, Canberra, ACT, Australia

Uncovering the genetic diversity in an unlabelled mixed sample of individuals is a challenging computational problem. However, this task can be of primary importance. For example, reconstructing the true set of viral haplotypes in an infected host is correlated to the clinical outcome and pathogenesis. We present a vector quantization approach to reconstruct the set of haplotype from a mixed sample of short read sequences when the number of haplotypes is unknown and the sequencing reads are unlabelled. Our method consists in mapping the high dimensional short read sequence space to a lower dimensional space representing the reconstructed haplotypes. We propose to encode each position in the nucleotide sequences as a vector where each element represent the contribution of each nucleotide base. Then, an artificial neural network is used to reconstruct the haplotype sequences using competitive learning. The network is tree shaped and can therefore be considered as a short read classification tree. During the learning process, bifurcating tree branches are added in the tree according to a dispersion criterion. Preliminary results show that (i) the true haplotype sequences can be reconstructed and (ii) the true number of haplotypes can be inferred from the size and the structure of the classification tree. In particular, the distance between the short read sequences to the tree node weight matrices is informative of the true set of haplotypes in the sample under study.