Skip to main content
Fig. 1 | Genome Biology

Fig. 1

From: Mash: fast genome and metagenome distance estimation using MinHash

Fig. 1

Overview of the MinHash bottom sketch strategy for estimating the Jaccard index. First, the sequences of two datasets are decomposed into their constituent k-mers (top, blue and red) and each k-mer is passed through a hash function h to obtain a 32- or 64-bit hash, depending on the input k-mer size. The resulting hash sets, A and B, contain |A| and |B| distinct hashes each (small circles). The Jaccard index is simply the fraction of shared hashes (purple) out of all distinct hashes in A and B. This can be approximated by considering a much smaller random sample from the union of A and B. MinHash sketches S(A) and S(B) of size s = 5 are shown for A and B, comprising the five smallest hash values for each (filled circles). Merging S(A) and S(B) to recover the five smallest hash values overall for AB (crossed circles) yields S(AB). Because S(AB) is a random sample of AB, the fraction of elements in S(AB) that are shared by both S(A) and S(B) is an unbiased estimate of J(A,B)

Back to article page