-
Random sample:
-
fixed size of final sample
-
fixed probability (binomial) for each element
-
spatial sampling
-
with/without replacement
-
weighted
-
by interestingness
-
stratified: partition important features into bins, and sample tastefully to achieve a smooth selection across bins. Think of density of phase space
-
consistent sample: the same sampling parameters on the same population will always return the same sample.
-
-
Algorithms:
-
iterative
-
batch
-
scan
-
reservoir
-
-
graph:
-
sample to preserve connectivity
-
sample to preserve local structure
-
sample to preserve global representation
-
-
random variates
-
Ziggurat Algorithm to use a simple lookup table to accelerate generation of complex distributions
-
We’re not going to worry about extracting samples larger than fit on one reducer.
The simplest kind of sample is a uniform sample selecting a fraction p
of the full dataset.
The naive way take is to generate a random number and select each line if it is less than the probability p
. Don’t do this.
You want your job to be deterministic. In the large, so that it is predictable and debuggable. in the small, a mapper may be re-tried if the attempt fails, or while doing speculative execution.
What we’ll do instead is use a standard digest function (for example, the MD5 hash or murmur hash). A digest function turns any key into a fixed-size number, with the important property that any small change in the input string results in an arbitrarily large change in the output number. It’s deterministic (the same input always gives the same output) but effectively washes out all information from the input string.
Then, rather than compare a random number against a fraction, we’ll turn the digest into an integer (by treating the lowest 64 bits as an integer) and compare it to that fraction of the largest 64-bit number.
<remark>confirm that it’s LSB not MSB we want</remark>
A [http://github.com/mrflip/wukong/blob/master/examples/sample_records.rb Ruby example] is available in the wukong examples:
# # Probabilistically emit some fraction of record/lines # # Set the sampling fraction at the command line using the # --sampling_fraction= # option: for example, to take a random 1/1000th of the lines in huge_files, # ./examples/sample_records.rb --sampling_fraction=0.001 --go huge_files sampled_files # class Mapper < Wukong::Streamer::LineStreamer include Wukong::Streamer::Filter
def intialize(*) super get_sampling_threshold end
# randomly decide to emit +sampling_fraction+ fraction of lines def emit? line digest_i < sampling_threshold end
protected
# Uses the sampling_fraction, a real value between 0 and 1 giving the fraction of lines to # emit. at sampling_fraction=1 all records are emitted, at 0 none are. # # @return [Integer] between 0 and MAX_DIGEST; values below the sampling_threshold should be emitted. def get_sampling_threshold if not options[:sampling_fraction] then raise ArgumentError, "Please supply a --sampling_fraction -- a real value between 0 and 1" ; end @sampling_threshold = (Float(options[:sampling_fraction]) * MAX_DIGEST).to_i end
# @return [Integer] the last 64 bits of the record's md5 hash def digest_i(record) Digest::MD5.digest(record.to_s).unpack('Q*').last end # One more than the largest possible digest int that digest_i will return. MAX_DIGEST = 2 ** 64 end
# Execute the script with nil reducer Script.new( Mapper, nil ).run
-
See this rapleaf blog post for why randomness is considered harmful.
Another, often faster, way of doing random sampling is to
generate a geometrically-distributed (rather than uniformly-distributed) sampling series
For each value R
, Your mapper skips R
lines and
see section on statistics for how to get a geometrically-distributed number.
Want to generate a sample of fixed size N_s — say, 1000 arbitrary records — no matter how large or small the dataset. (Clearly if it is smaller than N_s, you will emit the full dataset).
Suppose you assigned every record an arbitrary sample key, and sorted on that key. Choosing the first N_s records would be a fair way to get our sample. In fact, this is how most card games work: shuffle the records (cards) into an arbitrary order, and draw a fixed-size batch of cards from the collection.
But! of course, a total sort is very expensive. As you may guess, it’s unnecessary.
Each mapper creates a "reservoir", of size N_s, for the rows it will select. Add each record to the reservoir, and if there are more than N_s occupants, reject the record with highest sample index. (in practice, you won’t even add the record if it would be that highest record). A Fibonacci heap (implementing a priority queue) makes this very efficient
Ruby’s stdlib has a SortedSet
class — a Set that guarantees that it’s element are yielded in sorted order (according to the return values of their #<⇒
methods) when iterating over them.
Each mapper outputs the sampling index of each preserved row as the key, and the rest of the row as the value;
It’s essential that you keep the sampling index given by the first pass.
To make a set of uniformly distributed uncorrelated numbers,
For an n1
by n2
matrix of numbers,
matrix = [] n_elems = n1 * n2 (0..n1).each do |ii| matrix[ii] = row = [] (0..n2).each do |jj| row[jj] = digest( ii*n2 + jj ) / n_elems end end
-
Random Sampling from Databases, Frank Olken, 1993
-
RBTree for ruby
-
Stack Overflow: How to pick random (small) data samples using Map/Reduce? answer by Bkkbrad