Skip to content

Latest commit

 

History

History
133 lines (99 loc) · 6.23 KB

algorithms.md

File metadata and controls

133 lines (99 loc) · 6.23 KB

Algorithms for sampling without replacement

In the standard form of the problem, the answers can be in any order and so the order must also be chosen fairly; there is a variation in which the answers must be in sorted order. Any algorithm can be turned into one with a "sorted" order guarantee with an O(k log k) sort, or a "random" order guarantee with an O(k) Fisher-Yates shuffle.

Algorithm Python C++ Order guarantee Data structures Time
quadratic rejection sampling Python C++ Random none k^2
rejection sampling Python C++ Random Set k
iterative random choosing Python C++ Sorted none n
reservoir sampling Python C++ Random none n
SELECT Python C++ Random n-sized list n
HSEL Python C++ Random Dictionary k
Floyd's F2 Python C++ none Set k
"cardchoose" Python C++ Sorted none k log k

Quadratic rejection sampling

Algorithm Python C++ Order guarantee Data structures Time
quadratic rejection sampling Python C++ Random none k^2

For each output, try choosing an integer in [0, n), but check every integer already in the output to see if it's the same, and reject and re-draw if so. This is quadratic in k, but for small k the small constants can mean this algorithm does very well.

Rejection sampling

Algorithm Python C++ Order guarantee Data structures Time
rejection sampling Python C++ Random Set k

Maintain a set of already-produced values. While the output isn't full, try choosing a random integer in [0, n). If this is in the set of already-produced values, discard it. Otherwise, append it to the output, and add it to the list of already-produced values.

One of the simplest possible algorithms, but the cost of set lookups and insertions is high.

Iterative random choosing

Algorithm Python C++ Order guarantee Data structures Time
iterative random choosing Python C++ Sorted none n

Iterate through every possible value in [0, n) in order and decide whether to include it in the output, weighting the probability appropriately depending on the number of spaces left in the output and the number of values remaining to test. One of the faster algorithms for producing sorted output where k is a large fraction of n.

  • Donald Knuth
    • The Art of Computer Programming
    • Volume 2: Seminumerical Algorithms.
    • Section 3.4.2
    • Page 137
    • "Algorithm S"

Reservoir sampling

Algorithm Python C++ Order guarantee Data structures Time
reservoir sampling Python C++ Random none n

Initialize the output with the integers [0, k) in random order. Then iterate through the remaining values [k, n) in order and decide whether to replace an existing, randomly chosen output with that integer. Fast for producing random output where k is a large fraction of n.

  • Donald Knuth
    • The Art of Computer Programming
    • Volume 2: Seminumerical Algorithms.
    • Section 3.4.2
    • Page 138
    • "Algorithm R"

SELECT

Algorithm Python C++ Order guarantee Data structures Time
SELECT Python C++ Random n-sized list n

Initialize an array of the values [0, n). Pick out values at random to include in the output, and delete them by replacing them with a value from one end and shortening the array.

This is known as SELECT in Ernvall and Nevalainen, though in a slight variation we pick off the array at the start, rather than the end.

Ernvall-Nevalainen HSEL

Algorithm Python C++ Order guarantee Data structures Time
HSEL Python C++ Random Dictionary k

This is a variation of the above in which we use a hash table to simulate the array of values [0, n), and store only where we change it. First proposed by Ernvall and Nevalainen in 1982:

It is also presented in Martin Harriman, Sampling Without Replacement: Durstenfeld-Fisher-Yates Permutation for Very Large Permutation Sizes.

Floyd's F2

Algorithm Python C++ Order guarantee Data structures Time
Floyd's F2 Python C++ none Set k

I refer you to two excellent explanations:

The original presentation is as one of Jon Bentley's Programming Pearls, though the algorithm is due to Robert W Floyd: A Sample of Brilliance.

Cardchoose

Algorithm Python C++ Order guarantee Data structures Time
"cardchoose" Python C++ Sorted none k log k

This is my new algorithm. Detailed description.