The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Statistics::CaseResampling - Efficient resampling and calculation of medians with confidence intervals

SYNOPSIS

  use Statistics::CaseResampling ':all';
  
  my $sample = [1,3,5,7,1,2,9]; # ... usually MUCH more data ...
  my $confidence = 0.95; # ~2*sigma or "90% within confidence limits"
  #my $confidence = 0.37; # ~1*sigma or "~66% within confidence limits"
  
  # calculate the median of the sample with lower and upper confidence
  # limits using resampling/bootstrapping:
  my ($lower_cl, $median, $upper_cl)
    = median_simple_confidence_limits($sample, $confidence);
  
  # There are many auxiliary functions:
  
  my $resampled = resample($sample);
  # $resampled is now a random set of measurements from $sample,
  # including potential duplicates
  
  my $medians = resample_medians($sample, $n_resamples);
  # $medians is now an array reference containing the medians
  # of $n_resamples resample runs
  # This is vastly more efficient that doing the same thing with
  # repeated resample() calls!
  # Analogously:
  my $means = resample_means($sample, $n_resamples);
  
  # you can get the cl's from a set of separately resampled medians, too:
  my ($lower_cl, $median, $upper_cl)
    = simple_confidence_limits_from_samples($median, $medians, $confidence);
  
  # utility functions:
  print median([1..5]), "\n"; # prints 3
  print mean([1..5]), "\n"; # prints 3, too, surprise!
  print select_kth([1..5], 1), "\n"; # inefficient way to calculate the minimum

DESCRIPTION

The purpose of this (XS) module is to calculate the median (or in principle also other statistics) with confidence intervals on a sample. To do that, it uses a technique called bootstrapping. In a nutshell, it resamples the sample a lot of times and for each resample, it calculates the median. From the distribution of medians, it then calculates the confidence limits.

In order to implement the confidence limit calculation, various other functions had to be implemented efficiently (both algorithmically efficient and done in C). These functions may be useful in their own right and are thus exposed to Perl. Most notably, this exposes a median (and general selection) algorithm that works in linear time as opposed to the trivial implementation that requires O(n*log(n)).

Random numbers

The resampling involves drawing many random numbers. Therefore, the module comes with an embedded Mersenne twister (taken from Math::Random::MT).

If you want to change the seed of the RNG, do this:

  $Statistics::CaseResampling::Rnd
    = Statistics::CaseResampling::RdGen::setup($seed);
 

or

  $Statistics::CaseResampling::Rnd
    = Statistics::CaseResampling::RdGen::setup(@seed);

Do not use the embedded random number generator for other purposes. Use Math::Random::MT instead! At this point, you cannot change the type of RNG.

EXPORT

None by default.

Can export any of the functions that are documented below using standard Exporter semantics, including the customary :all group.

FUNCTIONS

This list of functions is loosely sorted from basic to comprehensive because the more complicated functions are usually (under the hood, in C) implemented using the basic building blocks. Unfortunately, that means you may want to read the documentation backwards. :)

Additionally, there is a whole set of general purpose, fast (XS) functions for calculating statistical metrics. They're useful without the bootstrapping related stuff, so they're listed in the "OTHER FUNCTIONS" section below.

All of these functions are written in C for speed.

resample(ARRAYREF)

Returns a reference to an array containing N random elements from the input array, where N is the length of the original array.

This is different from shuffling in that it's random drawing without removing the drawn elements from the sample.

resample_medians(ARRAYREF, NMEDIANS)

Returns a reference to an array containing the medians of NMEDIANS resamples of the original input sample.

resample_means(ARRAYREF, NMEANS)

Returns a reference to an array containing the means of NMEANS resamples of the original input sample.

simple_confidence_limits_from_median_samples(STATISTIC, STATISTIC_SAMPLES, CONFIDENCE)

Calculates the confidence limits for STATISTIC. Returns the lower confidence limit, the statistic, and the upper confidence limit. STATISTIC_SAMPLES is the output of, for example, resample_means(\@sample). CONFIDENCE indicates the fraction of data within the confidence limits (assuming a normal, symmetric distribution of the statistic => simple confidence limits).

For example, to get the 90% confidence (~2 sigma) interval for the mean of your sample, you can do the following:

  my $sample = [<numbers>];
  my $means = resample_means($sample, $n_resamples);
  my ($lower_cl, $mean, $upper_cl)
    = simple_confidence_limits_from_samples(mean($sample), $means, 0.90);

If you want to apply this logic to other statistics such as the first or third quartile, you have to manually resample and lose the advantage of doing it in C:

  my $sample = [<numbers>];
  my $quartiles = [];
  foreach (1..1000) {
    push @$quartiles, first_quartile(resample($sample));
  }
  my ($lower_cl, $firstq, $upper_cl)
    = simple_confidence_limits_from_samples(
        first_quartile($sample), $quartiles, 0.90
      );

For a reliable calculation of the confidence limits, you should use at least 1000 samples if not more. Yes. This whole procedure is expensive.

For medians, however, use the following function:

median_simple_confidence_limits(SAMPLE, CONFIDENCE, [NSAMPLES])

Calculates the confidence limits for the CONFIDENCE level for the median of SAMPLE. Returns the lower confidence limit, the median, and the upper confidence limit.

In order to calculate the limits, a lot of resampling has to be done. NSAMPLES defaults to 1000. Try running this a couple of times on your data interactively to see how the limits still vary a little bit at this setting.

OTHER FUNCTIONS

approx_erf($x)

Calculates an approximatation of the error function of x. Implemented after

  Winitzki, Sergei (6 February 2008).
  "A handy approximation for the error function and its inverse" (PDF). 
  http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf

Quoting: Relative precision better than 1.3e-4.

approx_erf_inv($x)

Calculates an approximation of the inverse of the error function of x.

Algorithm from the same source as approx_erf.

Quoting: Relative precision better than 2e-3.

nsigma_to_alpha($nsigma)

Calculates the probability that a measurement from a normal distribution is further away from the mean than $nsigma standard deviations.

The confidence level (what you pass as the CONFIDENCE parameter to some functions in this module) is 1 - nsigma_to_alpha($nsigma).

alpha_to_nsigma($alpha)

Inverse of nsigma_to_alpha().

mean(ARRAYREF)

Calculates the mean of a sample.

median(ARRAYREF)

Calculates the median (second quartile) of a sample. Works in linear time thanks to using a selection instead of a sort.

Unfortunately, the way this is implemented, the median of an even number of parameters is, here, defined as the n/2-1th largest number and not the average of the n/2-1th and the n/2th number. This shouldn't matter for nontrivial sample sizes!

median_absolute_deviation(ARRAYREF)

Calculates the median absolute deviation (MAD) in what I believe is O(n). Take care to rescale the MAD before using it in place of a standard deviation.

first_quartile(ARRAYREF)

Calculates the first quartile of the sample.

third_quartile(ARRAYREF)

Calculates the third quartile of the sample.

select_kth(ARRAYREF, K)

Selects the kth smallest element from the sample.

This is the general function that implements the median/quartile calculation. You get the median with this definition of K:

  my $k = int(@$sample/2) + (@$sample & 1);
  my $median = select_kth($sample, $k);

sample_standard_deviation

Given the sample mean and an anonymous array of numbers (the sample), calculates the sample standard deviation.

population_standard_deviation

Same as sample_standard_deviation, but without the correction to N.

TODO

One could calculate more statistics in C for performance.

SEE ALSO

Math::Random::MT

On the approximation of the error function:

  Winitzki, Sergei (6 February 2008).
  "A handy approximation for the error function and its inverse" (PDF). 
  http://homepages.physik.uni-muenchen.de/~Winitzki/erf-approx.pdf

The ~O(n) median implementation is based on C.A.R. Hoare's quickselect algorithm. See http://en.wikipedia.org/wiki/Selection_algorithm#Partition-based_general_selection_algorithm. Right now, it does not implement the Median of Medians algorithm that would guarantee linearity.

AUTHOR

Steffen Mueller, <smueller@cpan.org>

Daniel Dragan, <bulk88@hotmail.com>, who supplied MSVC compatibility patches.

COPYRIGHT AND LICENSE

Copyright (C) 2010, 2011, 2012, 2013 by Steffen Mueller

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.0 or, at your option, any later version of Perl 5 you may have available.