Skip to content

Latest commit

 

History

History
316 lines (180 loc) · 14.3 KB

08b-minhash_and_document_similarity.asciidoc

File metadata and controls

316 lines (180 loc) · 14.3 KB

Document Authorship

As we mentioned earlier, all previously known constructions of JL-embeddings required the multiplication of the input matrix with a dense, random matrix. Unfortunately, such general matrix multiplication can be very inefficient (and cumbersome) to carry out in a relational database.

Our constructions, on the other hand, replace the inner product operations of matrix multiplication with view selection and aggregation (addition). Using, say, distribution (2) of Theorem 1.1 the k new coordinates are generated by independently performing the following random experiment k times: throw away 2=3 of the original attributes at random; partition the remaining attributes randomly into two parts; for each part, produce a new attribute equal to the sum of all its attributes; take the difference of the two sum attributes.

requires n_components >= 4 log(n_samples) / (eps^2 / 2 - eps^3 / 3) dimensions.

Hashing

A strong-mixing hash function does two things: produces uniform output, and disperses nearby inputs (the "avalanche" effect). This mixing means that in the rare case two objects have a colliding hash, they are even-yet-more-rarely likely to be similar as well.

Suppose we say that two words are "close" when their edit distance is close (…​)

Take a copy of the Merriam-Webster dictionary

  • Number every term in the dictionary from 1 to N. The digest of a term is its order-index (if it’s not present use the prior word alphabetically). This has bad mixing and bad LSHness, but does completely fill its space

  • Instead, look up the word and count, from the top of the left-hand page down each column, until you get to the word. This

    • does a better job of mixing,

    • bad LSHness

    • does not fill out its space: some pages have very few terms, some have many, so there will be a lot more "2"s and "7"s than "20"s and "70"s.

  • The Pearson Hash — 

    h = 0; str.chars.each{|ch| index = h xor ch.ord ; h = permuted[index] }
    return h
Note

For these purposes, one good Hash function is as good as any other. If a library for any of the Murmur3, Jenkins or Spooky hashes are close at hand, use them; but you’re certain to have a built-in SHA1 or MD5 generator. Those latter are not great, but they’re good enough. And when you see something ask for a "family of N digest functions", you don’t need to hunt up N different algorithms, just N different mixings. Due to the avalanche effect (that a small perturbation of the input gives a large perturbation of the output) any trivial change will work just tack that number onto the end of the string [1] and digest it:

`def hasher(nn) ; suffix = nn.to_s; ->(str){ your_favorite_hash_function(str+suffix) } ; end`

Features

that can be sensibly compared (air temperature and number of home runs are numeric and have a meaningful distance; flight ID or

In contrast, categorical features(The R language refers to them as factors).

Similarity

  • There are a few features.

  • There are a dozen or a few dozen comparable features, and most of them have a value

  • There are hundreds, thousands or millions of features, and most of them are zero or nearly zero:

    • in the bag-of-words representation of a document, every word in the entire corpus is a feature; the value might be a 1/0 indicating the word’s presence or absence; or a measure of how often the word is used in the document (count, document frequency, or tf/idf, of which more later)

    • a problem with dozens of categorical variables might unpack to a problem with hundreds of comparable variables: census data (occupation; immigrant status; homeowner status; religion).

Jaccard Similarity

In the case where there are tons of features, it’s often enough that the mere presence — just being significantly larger than zero — is enough to judge things as similar (or potentially similar) in that aspect. If you’re backpacking in the Yukon and you meet another traveller, you’ll stop and share a drink: simply being a human being is similarity enough. Close to home, most people choose drinking buddies with similar politics, taste and so forth.

So the dumbest non-trivial way to evaluate "do they share the same topic" is "do they use the same words". That is, consider all words mentioned at least once in either document. How many of them appear in the other document? Since we’re now just talking about presence or absence of common terms, we can use the 'Jaccard Similarity' of sets to quantify their similarity:

    JacSim[ A, B ] =  (items in A and B) / (items in A or B or both)
                   =  ||A intersect B|| / ||A union B||
//		   = TODO TeX goes here

MinHash

  1. It’s OK to compare features by presence/absence and not degree (that is, Jaccard similarity is appropriate)

  2. The features are very sparse - most of them are zero.

By "similar" you might mean "potentially similar" — you’ll often use minhash to identify potentially similar objects, followed a more robust similarity measure. If so, just pretend I’m saying "potentially similar".

Minhash uses one seemingly-blunt move to execute three data transformations at once. . It’s like watching your Grandmom truss a turkey: what you thought she did in one yank was really a combined pull-stuff-twist-extend-tuck that will take a half hour for you to recreate with guests coming over.

  • Digests the feature name to a compact value: features are represented by a single machine word and not say a many-bytes-long character strings. This typically means faster comparisons, less memory usage, and less data sent to reducer.

  • Truncating the digest 'folds' the feature space on itself, so that you can store aggregate information about features (eg counts) in a bounded number of buckets rather than an unknowable amount of RAM.

  • "Randomly" permutes the feature names: the secret to how it approximates the Jaccard similarity

There are two different hashings going on:

  • a strong-mixing hash (eg murmur or MD5) to "consistently randomize" the features

  • a locality-sensitive hash (eg modulo N) to fold the feature space

Process

  • m: number of hash digests used to prepare the metri-fied measure

  • l: number of LSH signatures;

  • k: number of digests used in each min-hash signature

  • Define beforehand m strongly-mixing digest functions as Murmur.digest(str + idx.to_s). This performs a consistent "randomization" of its inputs.

  • For each of the m functions,

    • apply it to every word in the document

    • choose the digest with the lowest value

This part estimates the Jaccard similarity; we’ve defined a metric on the features!

Next, LSH:

Pr[ LSH(u) == LSH(v) ] == Pr[ MinHash(u) == MinHash(v) ] ** k

K sharpens the

accuracy

even if the bucket for the term also collides with the term zephyr, you’re trusting that similarity is robust enough to shine through even without proper detection of zephyr.

Locality-sensitive hash

First, if two items are nearby, there is a high probability they hash to the same value

Pr[ h(x) = h(y) ]    >=    P_l         for x, y such that |x-y| <= R
"The probability        is above       for value within a
the hash functions      the chosen     radius R of each other
are the same            limit 'P_l'

Second, if two items are far apart, there is a low probability they hash to the same value:

Pr[ h(x) = h(y) ]    <=    P_u         for x, y such that |x-y| <= R
"The probability        is *below*     for values *outside* a radius
the hash functions      the chosen     `c*R` of each other (a multiple
are the same            limit 'P_u'    of the 'nearbyness' radius)

An example of an LSH is the Hamming distance using j arbitrary indices (example: fingerprint identification)

let  J be the number of bits -- say, 32
let  R be the "near" radius  -- say,  2  (items can differ in 0, 1 or 2 bits)
let cR be the "far" radius -- say,    4  (c=2; items must hash separately if 5+ bits differ)
Then for P_l = 1 - ( R / J) = 1 - (2/32) = 15/16
         P_u = 1 - (cR / J) = 1 - (4/32) =  7/8
the hash family `h_j(x) = ((select the j'th bit of X))` is an LSH
X can
  • Jaccard Similarity (highly-sparse feature space, or sets) — MinHash

  • Hamming Similarity — Hamming bits as LSH

  • Euclidean Similarity — random cutting planes (Johnson-Lindenstrauss)

  • Notes from "Algorithms for Modern Data Models" course by Ashish Goel

Count-Min-Sketch

f(x) = (0..i-1).sum{|acc, i| acc += x_i * hsh[ digest[i] ] * dig_sgn

where dig_sgn is a hash function onto 1,-1.

Brute-force k-Nearest Neighbors

Even if the baseball data isn’t very big at 17,000 lines, brute-force comparison of each player to each other player requires nearly 150 million pairs. A walk in the park for Hadoop, but since it will give us something to compare against, let’s take that walk.

MinHash, the Game

So one monkey just blurts out words over and over and as soon as it’s in either player’s hand

you have a hand, I have a hand, we want to know how similar they are I will take another deck, shuffle and deal out cards when either player has that card, they show it —  if only both have it, count in denominator and numerator if only one has it, count in denominator

this ends up approximating the Jaccard Similarity

Dimensionality Game

Another game, similar but different from previous:

Dating pool

"everyone who is {

  • age, your

    age	    GPA	    Bowling-60
             *25     max 100
     25	    75

Dimensionality Reduction

If you’re taking the dot product, it’s OK to smush some of the dimensions together.

u1*u2 + v1*v2 + w1*w2 + x1*x2 + y1*y2 + z1*z2
(u1-v1)*(u2-v2) + (w1-x1)*(w2-x2) + (y1-z1)*(y2-z2)
u1*u2 + v1*v2 + w1*w2 + x1*x2 + y1*y2 + z1*z2
   - (u1*v2 + v1*u2 + )

But remember: if the dot product is near one, most of the coordinates of the two vectors agree.

Prob Production

Family of hash functions A locality sensitive hash function is useful in probabilistically determining if two points are "close" to each other or not. However, we are limited by the probabilities. E.g. a probability of p1 = 0.5 does not necessarily instill a reasonable confidence that the two points are close. However, if we have a family of such hash functions, then using probability theory, we can construct functions that give us arbitrarily high probabilities.

Here’s how the construction goes: if f1, f2, …​, fn are (d1, d2, p1, p2)-sensitive, then we can AND them together to construct a (d1, d2, p1n, p2n)-sensitive hash function. This has the effect of reducing the probability. if f1, f2, …​, fn are (d1, d2, p1, p2)-sensitive, then we can OR them together to construct a (d1, d2, 1 - (1- p1)n, 1 - (1 - p2)n)-sensitive hash function. This has the effect of increasing the probability. A combination of ANDs and ORs can get probabilities arbitrarily close to 1. The closer we want an LSH’s p1 to 1 (and p2 to 0), the more ANDs and ORs we need to perform.

Refs

Johnson-Lindenstrauss Transform:

Counting Streams (Count-Min-Sketch and friends):

Principal Component Analysis:

Exercises

The approach we use here can be a baseline for the practical art of authorship detection in legal discovery, where a


1. end, not front, due to a weakness in MD5