Show that 1 minus the Jaccard similarity is a distance measure between two data objects, x and y, that satisfies the metric axioms given on page 70. Specifically, d(x, y)=1 ? J(x, y).
What will be an ideal response?
1(a). Because J(x, y) ? 1, d(x, y) ? 0.
1(b). Because J(x, x)=1, d(x, x)=0
2. Because J(x, y) = J(y, x), d(x, y) = d(y, x)
3. (Proof due to Jeffrey Ullman)
minhash(x) is the index of first nonzero entry of x
prob(minhash(x) = k) is the probability tha minhash(x) = k when x is ran-
domly permuted.
Note that prob(minhash(x) = minhash(y)) = J(x, y) (minhash lemma)
Therefore, d(x, y)=1?prob(minhash(x) = minhash(y)) = prob(minhash(x) =
minhash(y))
We have to show that,
prob(minhash(x) = minhash(z)) ? prob(minhash(x) = minhash(y)) +
prob(minhash(y) = minhash(z)
However, note that whenever minhash(x) = minhash(z), then at least one of
minhash(x) = minhash(y) and minhash(y) = minhash(z) must be true.