Skip to content

Latest commit

 

History

History
142 lines (99 loc) · 6.04 KB

09b-sampling.asciidoc

File metadata and controls

142 lines (99 loc) · 6.04 KB

Sampling

  • Random sample:

    • fixed size of final sample

    • fixed probability (binomial) for each element

    • spatial sampling

    • with/without replacement

    • weighted

    • by interestingness

    • stratified: partition important features into bins, and sample tastefully to achieve a smooth selection across bins. Think of density of phase space

    • consistent sample: the same sampling parameters on the same population will always return the same sample.

  • Algorithms:

    • iterative

    • batch

    • scan

    • reservoir

  • graph:

    • sample to preserve connectivity

    • sample to preserve local structure

    • sample to preserve global representation

  • random variates

    • Ziggurat Algorithm to use a simple lookup table to accelerate generation of complex distributions

We’re not going to worry about extracting samples larger than fit on one reducer.

Consistent Random Sampling

The simplest kind of sample is a uniform sample selecting a fraction p of the full dataset.

The naive way take is to generate a random number and select each line if it is less than the probability p. Don’t do this.

You want your job to be deterministic. In the large, so that it is predictable and debuggable. in the small, a mapper may be re-tried if the attempt fails, or while doing speculative execution.

What we’ll do instead is use a standard digest function (for example, the MD5 hash or murmur hash). A digest function turns any key into a fixed-size number, with the important property that any small change in the input string results in an arbitrarily large change in the output number. It’s deterministic (the same input always gives the same output) but effectively washes out all information from the input string.

Then, rather than compare a random number against a fraction, we’ll turn the digest into an integer (by treating the lowest 64 bits as an integer) and compare it to that fraction of the largest 64-bit number.

<remark>confirm that it’s LSB not MSB we want</remark>

A [http://github.com/mrflip/wukong/blob/master/examples/sample_records.rb Ruby example] is available in the wukong examples:

#
# Probabilistically emit some fraction of record/lines
#
# Set the sampling fraction at the command line using the
#   --sampling_fraction=
# option: for example, to take a random 1/1000th of the lines in huge_files,
#  ./examples/sample_records.rb --sampling_fraction=0.001 --go huge_files sampled_files
#
class Mapper < Wukong::Streamer::LineStreamer
  include Wukong::Streamer::Filter
def intialize(*)
  super
  get_sampling_threshold
end
# randomly decide to emit +sampling_fraction+ fraction of lines
def emit? line
   digest_i < sampling_threshold
end
protected
# Uses the sampling_fraction, a real value between 0 and 1 giving the fraction of lines to
# emit.  at sampling_fraction=1 all records are emitted, at 0 none are.
#
# @return [Integer] between 0 and MAX_DIGEST; values below the sampling_threshold should be emitted.
def get_sampling_threshold
         if not options[:sampling_fraction] then raise ArgumentError, "Please supply a --sampling_fraction -- a real value between 0 and 1" ; end
  @sampling_threshold = (Float(options[:sampling_fraction]) * MAX_DIGEST).to_i
end
  # @return [Integer] the last 64 bits of the record's md5 hash
  def digest_i(record)
    Digest::MD5.digest(record.to_s).unpack('Q*').last
  end
  # One more than the largest possible digest int that digest_i will return.
  MAX_DIGEST = 2 ** 64
end
# Execute the script with nil reducer
Script.new( Mapper, nil ).run

Random Sampling using strides

Another, often faster, way of doing random sampling is to generate a geometrically-distributed (rather than uniformly-distributed) sampling series For each value R, Your mapper skips R lines and

see section on statistics for how to get a geometrically-distributed number.

Constant-Memory "Reservoir" Sampling

Want to generate a sample of fixed size N_s — say, 1000 arbitrary records — no matter how large or small the dataset. (Clearly if it is smaller than N_s, you will emit the full dataset).

Suppose you assigned every record an arbitrary sample key, and sorted on that key. Choosing the first N_s records would be a fair way to get our sample. In fact, this is how most card games work: shuffle the records (cards) into an arbitrary order, and draw a fixed-size batch of cards from the collection.

But! of course, a total sort is very expensive. As you may guess, it’s unnecessary.

Each mapper creates a "reservoir", of size N_s, for the rows it will select. Add each record to the reservoir, and if there are more than N_s occupants, reject the record with highest sample index. (in practice, you won’t even add the record if it would be that highest record). A Fibonacci heap (implementing a priority queue) makes this very efficient

Ruby’s stdlib has a SortedSet class — a Set that guarantees that it’s element are yielded in sorted order (according to the return values of their #<⇒ methods) when iterating over them.

Each mapper outputs the sampling index of each preserved row as the key, and the rest of the row as the value;

It’s essential that you keep the sampling index given by the first pass.

Sampling Distributions

To make a set of uniformly distributed uncorrelated numbers,

For an n1 by n2 matrix of numbers,

matrix = []
n_elems = n1 * n2
(0..n1).each do |ii|
  matrix[ii] = row = []
  (0..n2).each do |jj|
    row[jj] = digest( ii*n2 + jj ) / n_elems
  end
end