Reservoir Sampling

Reservoir sampling is a family of randomized algorithms for sampling k items from a stream of unknown size n. The reservoir maintains a uniform random sample as new items arrive.

The algorithm:

  • Reservoir (sample) initialized with first k elements
  • For item i from k to n:
    • Randomly replace an element in reservoir with probability k/i

Java example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
int[] reservoirSample(Iterator stream, int k) {
  int[] reservoir = new int[k];

  // Initialize reservoir  
  for (int i = 0; i < k; i++) {
    reservoir[i] = stream.next();
  }

  int i = k; 
  while (stream.hasNext()) {
    int next = stream.next();
    int j = rand.nextInt(i+1); 
    if (j < k) {
      reservoir[j] = next; 
    }
    i++;
  }

  return reservoir;
}

C++ example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
vector<int> reservoirSample(vector<int>::iterator iter, int k) {
  vector<int> reservoir(k);

  // Initialize reservoir

  for (int i = k; i < n; i++) {
    int j = rand() % (i+1);
    if (j < k) {
      reservoir[j] = *(iter++); 
    }
  }

  return reservoir; 
}

Python example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import random

def reservoir_sample(stream, k):

  reservoir = []
  for i in range(k):
    reservoir.append(next(stream))

  i = k 
  while True:
    next_elem = next(stream, None)
    if next_elem is None:
      break

    j = random.randint(0, i)
    if j < k:
      reservoir[j] = next_elem

    i += 1

  return reservoir

Reservoir sampling provides an unbiased random sample from a stream. Useful for statistics and machine learning.

Reservoir Sampling is an algorithmic technique used for randomly selecting a sample of k items from a list containing n items, where n is either a large or unknown number. The algorithm processes each item only once and uses constant memory, making it highly efficient. The key takeaway is that Reservoir Sampling allows you to maintain a representative sample without knowing the total size of the data stream upfront.

Java Code for Reservoir Sampling

In Java, you can use the java.util.Random class to generate random numbers for the sampling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import java.util.Random;

public class ReservoirSampling {
    public static int[] sample(int[] stream, int k) {
        int[] reservoir = new int[k];
        Random rand = new Random();

        for (int i = 0; i < k; i++) {
            reservoir[i] = stream[i];
        }

        for (int i = k; i < stream.length; i++) {
            int j = rand.nextInt(i + 1);
            if (j < k) {
                reservoir[j] = stream[i];
            }
        }

        return reservoir;
    }
}
  1. sample(int[] stream, int k) returns an array of k randomly selected items from stream.

C++ Code for Reservoir Sampling

In C++, you can use the <random> library to generate random numbers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#include <vector>
#include <random>

std::vector<int> sample(const std::vector<int>& stream, int k) {
    std::vector<int> reservoir(k);
    std::random_device rd;
    std::mt19937 gen(rd());

    for (int i = 0; i < k; ++i) {
        reservoir[i] = stream[i];
    }

    for (int i = k; i < stream.size(); ++i) {
        std::uniform_int_distribution<> dis(0, i);
        int j = dis(gen);
        if (j < k) {
            reservoir[j] = stream[i];
        }
    }

    return reservoir;
}
  1. sample(const std::vector<int>& stream, int k) returns a vector containing k random samples from stream.

Python Code for Reservoir Sampling

Python’s random library can be used for generating random numbers.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import random

def sample(stream, k):
    reservoir = []
    for i in range(k):
        reservoir.append(stream[i])

    for i in range(k, len(stream)):
        j = random.randint(0, i)
        if j < k:
            reservoir[j] = stream[i]

    return reservoir
  1. sample(stream, k) returns a list of k randomly selected items from stream.

All these implementations use the same underlying algorithm. They maintain a reservoir array that initially stores the first k elements. For each subsequent element, they generate a random index and replace the corresponding reservoir value if the index is within k. The result is a representative random sample of k items from the original list or stream.