Statistics::Sequences::Runs - descriptives, deviation and combinatorial tests of Wald-Wolfowitz runs
This is documentation for Version 0.21 of Statistics::Sequences::Runs.
use strict; use Statistics::Sequences::Runs 0.21; # not compatible with versions < .10 my $runs = Statistics::Sequences::Runs->new(); # Data make up a dichotomous sequence: my @data = (qw/1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1/); my $val; # - Pre-load data to use for all methods: $runs->load(\@data); $val = $runs->observed(); $val = $runs->expected(); # - or send data as "data => $aref" to each method: $val = $runs->observed(data => \@data); # - or send frequencies of each of the 2 elements: $val = $runs->expected(freqs => [11, 9]); # works with other methods except observed() # Deviation ratio: $val = $runs->z_value(ccorr => 1); # Probability: my ($z, $p) = $runs->z_value(ccorr => 1, tails => 1); # dev. ratio with p-value $val = $runs->p_value(tails => 1); # normal dist. p-value itself $val = $runs->p_value(exact => 1, tails => 1); # by combinatorics # Keyed list of descriptives etc.: my $href = $runs->stats_hash(values => {observed => 1, p_value => 1}, exact => 1); # Print descriptives etc. in the same way: $runs->dump( values => {observed => 1, expected => 1, p_value => 1}, exact => 1, flag => 1, precision_s => 3, precision_p => 7 ); # prints: observed = 11.000, expected = 10.900, p_value = 0.5700167
The module returns statistical information re Wald-type runs: a sequence of events on 1 or more consecutive trials. For example, in a signal-detection test composed of match (H) and miss (M) events over time like H-H-M-H-M-M-M-M-H, there are 5 runs: 3 Hs, 2 Ms. This number of runs between two events can be compared with the number expected to occur by chance over the number of trials, relative to the expected variance (see REFERENCES). More runs than expected ("negative serial dependence") can denote irregularity, instability, mixing up of alternatives. Fewer runs than expected ("positive serial dependence") can denote cohesion, insulation, isolation of alternatives. Both can indicate sequential dependency: either negative (a bias to produce too many alternations), or positive (a bias to produce too many repetitions).
The distribution of runs is asymptotically normal, and a deviation-based test of extra-chance occurrence when at least one alternative has more than 20 occurrences (Siegal rule), or both event occurrences exceed 10 (Kelly, 1982), is conventionally considered reliable; otherwise, the module provides an "exact test" option, based on combinatorics.
Have non-dichotomous, continuous or multinomial data? See Statistics::Data::Dichotomize for how to prepare them for test of runs.
$runs = Statistics::Sequences::Runs->new();
Returns a new Runs object. Expects/accepts no arguments but the classname.
$runs->load(@data); $runs->load(\@data); $runs->load(foodat => \@data); # labelled whatever
Loads data anonymously or by name - see load in the Statistics::Data manpage for details on the various ways data can be loaded and then retrieved (more than shown here).
After the load, the data are read to ensure that they contain only two unique elements - if not, carp occurs and 0 rather than 1 is returned.
Alternatively, skip this action; data don't always have to be loaded to use the stats methods here. To get the observed number of runs, data of course have to be loaded, but other stats can be got if given the observed count - otherwise, they too depend on data having been loaded.
Every load unloads all previous loads and any additions to them.
See Statistics::Data for these additional operations on data that have been loaded.
$v = $runs->observed(); # use the first data loaded anonymously $v = $runs->observed(index => 1); # ... or give the required "index" for the loaded data $v = $runs->observed(label => 'foodat'); # ... or its "label" value $v = $runs->observed(data => [1, 0, 1, 1]); # ... or just give the data now
Returns the total observed number of runs in the loaded or given data. For example,
$v = $runs->observed_per_state(data => [qw/H H H T T H H/]);
returns 3.
Aliases: runcount_observed, rco
@freq = $runs->observed_per_state(data => \@data); $href = $runs->observed_per_state(data => \@data);
Returns the number of runs per state - as an array where the first element gives the count for the first state in the data, and so for the second. A keyed hashref is returned if not called in array context. For example:
@ari = $runs->observed_per_state(data => [qw/H H H T T H H/]); # returns (2, 1) $ref = $runs->observed_per_state(data => [qw/H H H T T H H/]); # returns { H => 2, T => 1}
$v = $runs->expected(); # or specify loaded data by "index" or "label", or give it as "data" - see observed() $v = $runs->expected(data => [qw/blah bing blah blah blah/]); # use these data $v = $runs->expected(freqs => [12, 7]); # don't use actual data; calculate from these two Ns
Returns the expected number of runs across the loaded data. Expectation is given as follows:
E[R] = ( (2n_{1}n_{2}) / (n_{1} + n_{2}) ) + 1
where n(i) is the number of observations of each element in the data.
Aliases: runcount_expected, rce
$v = $runs->variance(); # use data already loaded - anonymously; or specify its "label" or "index" - see observed() $v = $runs->variance(data => [qw/blah bing blah blah blah/]); # use these data $v = $runs->variance(freqs => [5, 12]); # use these trial numbers - not any particular sequence of data
Returns the variance in the number of runs for the given data.
V[R] = ( (2n_{1}n_{2})([2n_{1}n_{2}] – [n_{1} + n_{2}]) ) / ( ((n_{1} + n_{2})^{2})((n_{1} + n_{2}) – 1) )
defined as above for runcount_expected.
The data to test can already have been loaded, or you send it directly as a flat referenced array keyed as data.
Aliases: runcount_variance, rcv
$v = $runs->obsdev(); # use data already loaded - anonymously; or specify its "label" or "index" - see observed() $v = $runs->obsdev(data => [qw/blah bing blah blah blah/]); # use these data
Returns the deviation of (difference between) observed and expected runs for the loaded/given sequence (O - E).
Alias: obsdev
$v = $runs->stdev(); # use data already loaded - anonymously; or specify its "label" or "index" - see observed() $v = $runs->stdev(data => [qw/blah bing blah blah blah/]);
Returns square-root of the variance.
Alias: stdev
Returns run skewness as given by Barton & David (1958) based on the frequencies of the two different elements in the sequence.
Returns run kurtosis as given by Barton & David (1958) based on the frequencies of the two different elements in the sequence.
$p = $runs->pmf(data => \@data); # or no args to use last pre-loaded data $p = $runs->pmf(observed => 5, freqs => [5, 20]);
Implements the runs probability mass function, returning the probability for a particular number of runs given so many dichotomous events (e.g., as in Swed & Eisenhart, 1943, p. 66); i.e., for u' the observed number of runs, P{u = u'}. The required function parameters are the observed number of runs, and the frequencies (counts) of each state in the sequence, which can be given directly, as above, in the arguments observed and freqs, respectively, or these will be worked out from a given data sequence itself (given here or as pre-loaded). For derivation, see its public internal methods n_max_seq and m_seq_k.
$p = $runs->cdf(data => \@data); # or no args to use last pre-loaded data $p = $runs->cdf(observed => 5, freqs => [5, 20]);
Implements the cumulative distribution function for runs, returning the probability of obtaining the observed number of runs or less down to the expected number of 2 (assuming that the two possible events are actually represented in the data), as per Swed & Eisenhart (1943), p. 66; i.e., for u' the observed number of runs, P{u <= u'}. The summation is over the probability mass function pmf. The function parameters are the observed number of runs, and the frequencies (counts) of the two events, which can be given directly, as above, in the arguments observed and freqs, respectively, or these will be worked out from a given data sequence itself (given here or as pre-loaded).
$p = $runs->cdfi(data => \@data); # or no args for last pre-loaded data $p = $runs->cdfi(observed => 11, freqs => [5, 11]);
Implements the (inverse) cumulative distribution function for runs, returning the probability of obtaining more than the observed number of runs up from the expected number of 2 (assuming that the two possible events are actually represented in the data), as per Swed & Eisenhart (1943), p. 66; ; i.e., for u' the observed number of runs, P = 1 - P{u <= u' - 1}. The summation is over the probability mass function pmf. The function parameters are the observed number of runs, and the frequencies (counts) of the two events, which can be given directly, as above, in the arguments observed and freqs, respectively, or these will be worked out from a given data sequence itself (given here or as pre-loaded).
$v = $runs->z_value(ccorr => 1); # use data already loaded - anonymously; or specify its "label" or "index" - see observed() $v = $runs->z_value(data => $aref, ccorr => 1); ($zvalue, $pvalue) = $runs->z_value(data => $aref, ccorr => 1, tails => 2); # same but wanting an array, get the p-value too
Returns the zscore from a test of runcount deviation, taking the runcount expected away from that observed and dividing by the root expected runcount variance, by default with a continuity correction to expectation. Called wanting an array, returns the z-value with its p-value for the tails (1 or 2) given.
The data to test can already have been loaded, or sent directly as an aref keyed as data.
Other options are precision_s (for the z_value) and precision_p (for the p_value).
Aliases: runcount_zscore, rzs, zscore
$p = $runs->p_value(); # using loaded data and default args $p = $runs->p_value(ccorr => 0|1, tails => 1|2); # normal-approx. for last-loaded data $p = $runs->p_value(exact => 1); # calc combinatorially for observed >= or < than expectation $p = $runs->p_value(data => [1, 0, 1, 1, 0], exact => 1); # given data $p = $runs->p_value(freqs => [12, 12], observed => 8); # no data sequence, specify known params
Returns the probability of getting the observed number of runs or a smaller number given the number of each of the two events. By default, a large sample is assumed, and the probability is obtained from the normalized deviation, as given by the zscore method.
If the option exact is defined and not zero, then the probability is worked out combinatorially, as per Swed & Eisenhart (1943), Eq. 1, p. 66 (see also Siegal, 1956, Eqs. 6.12a and 6.12b, p. 138). By default, this is a one-tailed test, testing the hypotheses that there are either too many or too few runs relative to chance expectation; the "correct" hypothesis is tested based on the expected value returned by the expected method. Setting tails => 2 simply doubles the one-tailed p-value from any of these tests. Output from these tests has been checked against the tables and examples in Swed & Eisenhart (given to 7 decimal places), and found to agree.
The option precision_p gives the returned p-value to so many decimal places.
Aliases: test, runs_test, rct
Likelihood ratio chi-square test for runs by length.
Returns true for the loaded sequence if its constituent sample numbers are sufficient for their expected runs to be normally approximated - using Siegal's (1956, p. 140) rule - ok if either of the two Ns are greater than 20.
Methods used internally, or for returning/printing descriptives, etc., in a bunch.
@freq = $runs->bi_frequency(data => \@data); # or no args if using last pre-loaded data
Returns frequency of the two elements - or croaks if there are more than 2, and gives zero for any absent.
$n = $runs->n_max_seq(); # loaded data $n = $runs->n_max_seq(data => \@data); # this sequence $n = $runs->n_max_seq(observed => int, freqs => [int, int]); # these specs
Returns the number of possible sequences for the two given state frequencies. So the proverbial urn contains N1 black balls and N2 white balls, well mixed, and take N1 + N2 drawings from it without replacement, so any sequence has the same probability of occurring; how many different sequences of black and white balls are possible? For the two counts, this is "sum of N1 + N2 choose N1", or:
N_{max} = ( N_{1} + N_{2} )! / N_{1}!N_{2}!
With the usual definition of a probability as M / N, this is the denominator term in the runs probability mass function (pmf). This does not take into account the probability of obtaining so many of each event, of the proportion of black and white balls in the urn. (That's work and play for another day.)
$n = $runs->m_seq_k(); # loaded data $n = $runs->m_seq_k(data => \@data); # this sequence $n = $runs->m_seq_k(observed => int, freqs => [int, int]); # these specs
Returns the number of sequences that can produce k runs from m elements of a single kind, with all other kinds of elements in the sequence assumed to be of a single kind, under the conditions of n_max_seq. See Swed and Eisenhart (1943), or barton and David (1958, p. 253). With the usual definition of a probability as M / N, this is the numerator term in the runs probability mass function (pmf).
$href = $runs->stats_hash(values => {observed => 1, expected => 1, variance => 1, z_value => 1, p_value => 1}, exact => 0, ccorr => 1);
Returns a hashref for the counts and stats as specified in its "values" argument, and with any options for calculating them (e.g., exact for p_value). See "stats_hash" in Statistics::Sequences for details. If calling via a "runs" object, the option "stat => 'runs'" is not needed (unlike when using the parent "sequences" object).
$runs->dump(values => { observed => 1, variance => 1, p_value => 1}, exact => 1, flag => 1, precision_s => 3); # among other options
Print Runs-test results to STDOUT. See "dump" in Statistics::Sequences for details of what stats to dump (default is observed() and p_value()). Optionally also give the data directly.
$runs->dump_data(delim => "\n"); # print whatevers loaded (or specify by label, index, or as "data")
See "dump_data" in Statistics::Sequences for details.
Swed and Eisenhart (1943) list the occupied (O) and empty (E) seats in a row at a lunch counter. Have people taken up their seats on a random basis?
use Statistics::Sequences::Runs; my $runs = Statistics::Sequences::Runs->new(); my @seating = (qw/E O E E O E E E O E E E O E O E/); # data already form a single sequence with dichotomous observations $runs->dump(data => \@seating, exact => 1, tails => 1);
Suggesting some non-random basis for people taking their seats, this prints:
observed = 11, p_value = 0.054834
But these data would fail Siegal's rule (ztest_ok = 0) (neither state has 20 observations). So just check exact probability of the hypothesis that the observed deviation is greater than zero (1-tailed):
$runs->dump(data => \@seating, values => {'p_value'}, exact => 1, tails => 1);
This prints a p-value of .0576923 (so the normal approximation seems good in any case).
These data are also used in an example of testing for Vnomes.
In a single run of a classic ESP test, there are 25 trials, each composed of a randomly generated event (typically, one of 5 possible geometric figures), and a human-generated event arbitrarily drawn from the same pool of alternatives. Tests of the match between the random and human data are typically for number of matches observed versus expected. The runs of matches and misses can be tested by dichotomizing the data on the basis of the match of the random "targets" with the human "responses", as described by Kelly (1982):
use Statistics::Sequences::Runs; use Statistics::Data::Dichotomize; my @targets = (qw/p c p w s p r w p c r c r s s s s r w p r w c w c/); my @responses = (qw/p c s c s s p r w r w c c s s r w s w p c r w p r/); # Test for runs of matches between targets and responses: my $runs = Statistics::Sequences::Runs->new(); my $ddat = Statistics::Data::Dichotomize->new(); $runs->load($ddat->match(data => [\@targets, \@responses])); $runs->dump_data(delim => ' '); # have a look at the match sequence; prints "1 1 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0\n" print "Probability of these many runs vs expectation: ", $runs->test(), "\n"; # 0.51436 # or test for runs in matching when responses are matched to targets one trial behind: print $runs->test(data => $ddat->match(data => [\@targets, \@responses], lag => -1)), "\n"; # 0.73766
These papers provide the implemented algorithms and/or the sample data used in examples and tests.
Barton, D. E., & David, F. N. (1958). Non-randomness in a sequence of two alternatives: II. Runs test. Biometrika, 45, 253-256. doi: 10.2307/2333062
Kelly, E. F. (1982). On grouping of hits in some exceptional psi performers. Journal of the American Society for Psychical Research, 76, 101-142.
Siegal, S. (1956). Nonparametric statistics for the behavioral sciences. New York, NY, US: McGraw-Hill.
Swed, F., & Eisenhart, C. (1943). Tables for testing randomness of grouping in a sequence of alternatives. Annals of Mathematical Statistics, 14, 66-87. doi: 10.1214/aoms/1177731494
Wald, A., & Wolfowitz, J. (1940). On a test whether two samples are from the same population. Annals of Mathematical Statistics, 11, 147-162. doi: 10.1214/aoms/1177731909
Wolfowitz, J. (1943). On the theory of runs with some applications to quality control. Annals of Mathematical Statistics, 14, 280-288. doi: 10.1214/aoms/1177731421
Statistics::Sequences for other tests of sequences, such as ...
Statistics::Sequences::Pot, and for sharing data between these tests.
Roderick Garton, <rgarton at cpan.org>
This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).
To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.