Abigail > Algorithm-Numerical-Sample-2010011201 > Algorithm::Numerical::Sample
Module Version: 2010011201

# NAME

Algorithm::Numerical::Sample - Draw samples from a set

# SYNOPSIS

```    use Algorithm::Numerical::Sample  qw /sample/;

@sample = sample (-set         => [1 .. 10000],
-sample_size => 100);

\$sampler = Algorithm::Numerical::Sample::Stream -> new;
while (<>) {\$sampler -> data (\$_)}
\$random_line = \$sampler -> extract;```

# DESCRIPTION

This package gives two methods to draw fair, random samples from a set. There is a procedural interface for the case the entire set is known, and an object oriented interface when the a set with unknown size has to be processed.

## A: `sample (set => ARRAYREF [,sample_size => EXPR])`

The `sample` function takes a set and a sample size as arguments. If the sample size is omitted, a sample of `1` is taken. The keywords `set` and `sample_size` may be preceeded with an optional `-`. The function returns the sample list, or a reference to the sample list, depending on the context.

## B: `Algorithm::Numerical::Sample::Stream`

The class `Algorithm::Numerical::Sample::Stream` has the following methods:

`new`

This function returns an object of the `Algorithm::Numerical::Sample::Stream` class. It will take an optional argument of the form `sample_size => EXPR`, where `EXPR` evaluates to the sample size to be taken. If this argument is missing, a sample of size `1` will be taken. The keyword `sample_size` may be preceeded by an optional dash.

`data (LIST)`

The method `data` takes a list of parameters which are elements of the set we are sampling. Any number of arguments can be given.

`extract`

This method will extract the sample from the object, and reset it to a fresh state, such that a sample of the same size but from a different set, can be taken. `extract` will return a list in list context, or the first element of the sample in scalar context.

# CORRECTNESS PROOFS

## Algorithm A.

Crucial to see that the `sample` algorithm is correct is the fact that when we sample `n` elements from a set of size `N` that the `t + 1`st element is choosen with probability `(n - m)/(N - t)`, when already `m` elements have been choosen. We can immediately see that we will never pick too many elements (as the probability is 0 as soon as `n == m`), nor too few, as the probability will be 1 if we have `k` elements to choose from the remaining `k` elements, for some `k`. For the proof that the sampling is unbiased, we refer to [3]. (Section 3.4.2, Exercise 3).

## Algorithm B.

It is easy to see that the second algorithm returns the correct number of elements. For a sample of size `n`, the first `n` elements go into the reservoir, and after that, the reservoir never grows or shrinks in size; elements only get replaced. A detailed proof of the fairness of the algorithm appears in [3]. (Section 3.4.2, Exercise 7).

# LITERATURE

Both algorithms are discussed by Knuth [3] (Section 3.4.2). The first algoritm, Selection sampling technique, was discovered by Fan, Muller and Rezucha [1], and independently by Jones [2]. The second algorithm, Reservoir sampling, is due to Waterman.

# REFERENCES

[1]

C. T. Fan, M. E. Muller and I. Rezucha, J. Amer. Stat. Assoc. 57 (1962), pp 387 - 402.

[2]

T. G. Jones, CACM 5 (1962), pp 343.

[3]

D. E. Knuth: The Art of Computer Programming, Volume 2, Third edition. Reading: Addison-Wesley, 1997. ISBN: 0-201-89684-2.

# DEVELOPMENT

The current sources of this module are found on github, git://github.com/Abigail/algorithm--numerical--sample.git.

# AUTHOR

This package was written by Abigail, cpan@abigail.be.