Modern techniques in population genomics generate unprecedented quantities of data within
which complex genetic histories reside. The scale and complexity of these data require
the development of new approaches to the analysis of genetic data. We present the k-mer
Weighted Inner Product, a de novo, alignment free measure of genetic similarity between
samples in a population. kWIP, is an efficient tool implementing this metric that can determine
the genetic relatedness between samples without alignment or assembly. We show kWIP can
reconstruct the true relatedness between samples directly from sequencing reads generated
with various modern sequencing platforms.
kWIP works by decomposing sequencing reads to short k-mers, hashing these k-mers using
a constant-memory data structure, and performing pairwise distance calculation between
these sample k-mer hashes. The power of kWIP comes from the weighting applied across
different hash values, which decreases the effect of erroneous, rare or over-abundant k-mers
while focusing on k-mers which give the most insight into the similarity of samples. We
use simulation studies to quantify the increased accuracy of this weighting over existing
unweighted distance metrics. kWIP is free, open source software implemented in C++ and
released under the GNU LGPL v3.