Document Authorship

As we mentioned earlier, all previously known constructions of JL-embeddings required the multiplication of the input matrix with a dense, random matrix. Unfortunately, such general matrix multiplication can be very inefficient (and cumbersome) to carry out in a relational database.

Our constructions, on the other hand, replace the inner product operations of matrix multiplication with view selection and aggregation (addition). Using, say, distribution (2) of Theorem 1.1 the k new coordinates are generated by independently performing the following random experiment k times: throw away 2=3 of the original attributes at random; partition the remaining attributes randomly into two parts; for each part, produce a new attribute equal to the sum of all its attributes; take the difference of the two sum attributes.

http://users.soe.ucsc.edu/~optas/papers/jl.pdf

requires n_components >= 4 log(n_samples) / (eps^2 / 2 - eps^3 / 3) dimensions.

Hashing

A strong-mixing hash function does two things: produces uniform output, and disperses nearby inputs (the "avalanche" effect). This mixing means that in the rare case two objects have a colliding hash, they are even-yet-more-rarely likely to be similar as well.

Suppose we say that two words are "close" when their edit distance is close (…)

Take a copy of the Merriam-Webster dictionary

Number every term in the dictionary from 1 to N. The digest of a term is its order-index (if it’s not present use the prior word alphabetically). This has bad mixing and bad LSHness, but does completely fill its space
Instead, look up the word and count, from the top of the left-hand page down each column, until you get to the word. This
- does a better job of mixing,
- bad LSHness
- does not fill out its space: some pages have very few terms, some have many, so there will be a lot more "2"s and "7"s than "20"s and "70"s.

The Pearson Hash —

h = 0; str.chars.each{|ch| index = h xor ch.ord ; h = permuted[index] }
return h

Note

For these purposes, one good Hash function is as good as any other. If a library for any of the Murmur3, Jenkins or Spooky hashes are close at hand, use them; but you’re certain to have a built-in SHA1 or MD5 generator. Those latter are not great, but they’re good enough. And when you see something ask for a "family of N digest functions", you don’t need to hunt up N different algorithms, just N different mixings. Due to the avalanche effect (that a small perturbation of the input gives a large perturbation of the output) any trivial change will work just tack that number onto the end of the string ^[1] and digest it:

`def hasher(nn) ; suffix = nn.to_s; ->(str){ your_favorite_hash_function(str+suffix) } ; end`

Features

that can be sensibly compared (air temperature and number of home runs are numeric and have a meaningful distance; flight ID or

In contrast, categorical features(The R language refers to them as factors).

Similarity

There are a few features.
There are a dozen or a few dozen comparable features, and most of them have a value
There are hundreds, thousands or millions of features, and most of them are zero or nearly zero:
- in the bag-of-words representation of a document, every word in the entire corpus is a feature; the value might be a 1/0 indicating the word’s presence or absence; or a measure of how often the word is used in the document (count, document frequency, or tf/idf, of which more later)
- a problem with dozens of categorical variables might unpack to a problem with hundreds of comparable variables: census data (occupation; immigrant status; homeowner status; religion).

Jaccard Similarity

In the case where there are tons of features, it’s often enough that the mere presence — just being significantly larger than zero — is enough to judge things as similar (or potentially similar) in that aspect. If you’re backpacking in the Yukon and you meet another traveller, you’ll stop and share a drink: simply being a human being is similarity enough. Close to home, most people choose drinking buddies with similar politics, taste and so forth.

So the dumbest non-trivial way to evaluate "do they share the same topic" is "do they use the same words". That is, consider all words mentioned at least once in either document. How many of them appear in the other document? Since we’re now just talking about presence or absence of common terms, we can use the 'Jaccard Similarity' of sets to quantify their similarity:

    JacSim[ A, B ] =  (items in A and B) / (items in A or B or both)
                   =  ||A intersect B|| / ||A union B||
//		   = TODO TeX goes here

MinHash

It’s OK to compare features by presence/absence and not degree (that is, Jaccard similarity is appropriate)
The features are very sparse - most of them are zero.

By "similar" you might mean "potentially similar" — you’ll often use minhash to identify potentially similar objects, followed a more robust similarity measure. If so, just pretend I’m saying "potentially similar".

Minhash uses one seemingly-blunt move to execute three data transformations at once. . It’s like watching your Grandmom truss a turkey: what you thought she did in one yank was really a combined pull-stuff-twist-extend-tuck that will take a half hour for you to recreate with guests coming over.

Digests the feature name to a compact value: features are represented by a single machine word and not say a many-bytes-long character strings. This typically means faster comparisons, less memory usage, and less data sent to reducer.
Truncating the digest 'folds' the feature space on itself, so that you can store aggregate information about features (eg counts) in a bounded number of buckets rather than an unknowable amount of RAM.
"Randomly" permutes the feature names: the secret to how it approximates the Jaccard similarity

There are two different hashings going on:

a strong-mixing hash (eg murmur or MD5) to "consistently randomize" the features
a locality-sensitive hash (eg modulo N) to fold the feature space

Process

m: number of hash digests used to prepare the metri-fied measure
l: number of LSH signatures;
k: number of digests used in each min-hash signature
Define beforehand m strongly-mixing digest functions as Murmur.digest(str + idx.to_s). This performs a consistent "randomization" of its inputs.
For each of the m functions,
- apply it to every word in the document
- choose the digest with the lowest value

This part estimates the Jaccard similarity; we’ve defined a metric on the features!

Next, LSH:

Pr[ LSH(u) == LSH(v) ] == Pr[ MinHash(u) == MinHash(v) ] ** k

K sharpens the

accuracy

even if the bucket for the term also collides with the term zephyr, you’re trusting that similarity is robust enough to shine through even without proper detection of zephyr.

Locality-sensitive hash

First, if two items are nearby, there is a high probability they hash to the same value

Pr[ h(x) = h(y) ]    >=    P_l         for x, y such that |x-y| <= R

"The probability        is above       for value within a
the hash functions      the chosen     radius R of each other
are the same            limit 'P_l'

Second, if two items are far apart, there is a low probability they hash to the same value:

Pr[ h(x) = h(y) ]    <=    P_u         for x, y such that |x-y| <= R

"The probability        is *below*     for values *outside* a radius
the hash functions      the chosen     `c*R` of each other (a multiple
are the same            limit 'P_u'    of the 'nearbyness' radius)

An example of an LSH is the Hamming distance using j arbitrary indices (example: fingerprint identification)

let  J be the number of bits -- say, 32
let  R be the "near" radius  -- say,  2  (items can differ in 0, 1 or 2 bits)
let cR be the "far" radius -- say,    4  (c=2; items must hash separately if 5+ bits differ)
Then for P_l = 1 - ( R / J) = 1 - (2/32) = 15/16
         P_u = 1 - (cR / J) = 1 - (4/32) =  7/8
the hash family `h_j(x) = ((select the j'th bit of X))` is an LSH

X can

Jaccard Similarity (highly-sparse feature space, or sets) — MinHash
Hamming Similarity — Hamming bits as LSH
Euclidean Similarity — random cutting planes (Johnson-Lindenstrauss)
Notes from "Algorithms for Modern Data Models" course by Ashish Goel

Count-Min-Sketch

From Hashing for linear functions

f(x) = (0..i-1).sum{|acc, i| acc += x_i * hsh[ digest[i] ] * dig_sgn

where dig_sgn is a hash function onto 1,-1.

Brute-force k-Nearest Neighbors

Even if the baseball data isn’t very big at 17,000 lines, brute-force comparison of each player to each other player requires nearly 150 million pairs. A walk in the park for Hadoop, but since it will give us something to compare against, let’s take that walk.

MinHash, the Game

So one monkey just blurts out words over and over and as soon as it’s in either player’s hand

you have a hand, I have a hand, we want to know how similar they are I will take another deck, shuffle and deal out cards when either player has that card, they show it — if only both have it, count in denominator and numerator if only one has it, count in denominator

this ends up approximating the Jaccard Similarity

Dimensionality Game

Another game, similar but different from previous:

Dating pool

"everyone who is {

age, your

age	    GPA	    Bowling-60
         *25     max 100
 25	    75

Dimensionality Reduction

If you’re taking the dot product, it’s OK to smush some of the dimensions together.

u1*u2 + v1*v2 + w1*w2 + x1*x2 + y1*y2 + z1*z2

(u1-v1)*(u2-v2) + (w1-x1)*(w2-x2) + (y1-z1)*(y2-z2)

u1*u2 + v1*v2 + w1*w2 + x1*x2 + y1*y2 + z1*z2
   - (u1*v2 + v1*u2 + )

But remember: if the dot product is near one, most of the coordinates of the two vectors agree.

Prob Production

Family of hash functions A locality sensitive hash function is useful in probabilistically determining if two points are "close" to each other or not. However, we are limited by the probabilities. E.g. a probability of p1 = 0.5 does not necessarily instill a reasonable confidence that the two points are close. However, if we have a family of such hash functions, then using probability theory, we can construct functions that give us arbitrarily high probabilities.

Here’s how the construction goes: if f1, f2, …, fn are (d1, d2, p1, p2)-sensitive, then we can AND them together to construct a (d1, d2, p1n, p2n)-sensitive hash function. This has the effect of reducing the probability. if f1, f2, …, fn are (d1, d2, p1, p2)-sensitive, then we can OR them together to construct a (d1, d2, 1 - (1- p1)n, 1 - (1 - p2)n)-sensitive hash function. This has the effect of increasing the probability. A combination of ANDs and ORs can get probabilities arbitrarily close to 1. The closer we want an LSH’s p1 to 1 (and p2 to 0), the more ANDs and ORs we need to perform.

(from http://ravi-bhide.blogspot.com/2011/03/locality-sensitive-hashing-explored.html)

Refs

Locality-Sensitive Hashing for Finding Nearest Neighbors by Malcolm Slaney and Michael Casey
http://lingpipe-blog.com/2011/01/12/scaling-jaccard-distance-deduplication-shingling-minhash-locality-sensitive-hashi/
http://www.scribd.com/collections/2287653/Locality-Sensitive-Hashing
http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
http://infolab.stanford.edu/~ullman/mining/2009/similarity2.pdf
http://infolab.stanford.edu/~ullman/mining/2009/similarity1.pdf
http://metaoptimize.com/qa/questions/8930/basic-questions-about-locally-sensitive-hashinglsh
http://users.soe.ucsc.edu/~optas/papers/jl.pdf
An Appreciation of Locality Sensitive Hashing
Efficient Online Locality Sensitive Hashing via Reservoir Counting
http://blog.smola.org/post/1130198570/hashing-for-collaborative-filtering

Johnson-Lindenstrauss Transform:

Johnson-Lindenstrauss with binary coins
https://www.cs.princeton.edu/~chazelle/pubs/stoc06.pdf
The Fast Johnson–Lindenstrauss Transform And Approximate Nearest Neighbors
http://scikit-learn.org/stable/auto_examples/plot_johnson_lindenstrauss_bound.html

Counting Streams (Count-Min-Sketch and friends):

Stream sampling for variance-optimal estimation of subset sums
Excellent collection of R tutorials
The split-apply-combine pattern in R
Reshape R Package *

Principal Component Analysis:

Principal component analysis Wikipedia entry

Exercises

The approach we use here can be a baseline for the practical art of authorship detection in legal discovery, where a

1. end, not front, due to a weakness in MD5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

08b-minhash_and_document_similarity.asciidoc

08b-minhash_and_document_similarity.asciidoc

Document Authorship

Hashing

Features

Similarity

Jaccard Similarity

MinHash

Process

accuracy

Locality-sensitive hash

Count-Min-Sketch

Brute-force k-Nearest Neighbors

MinHash, the Game

Dimensionality Game

Dimensionality Reduction

Prob Production

Refs

Exercises

Files

08b-minhash_and_document_similarity.asciidoc

Latest commit

History

08b-minhash_and_document_similarity.asciidoc

File metadata and controls

Document Authorship

Hashing

Features

Similarity

Jaccard Similarity

MinHash

Process

accuracy

Locality-sensitive hash

Count-Min-Sketch

Brute-force k-Nearest Neighbors

MinHash, the Game

Dimensionality Game

Dimensionality Reduction

Prob Production

Refs

Exercises