roderick garton > Statistics-Sequences-0.051 > Statistics::Sequences::Vnomes
Module Version: 0.04

# NAME

Statistics::Sequences::Vnomes - The Serial Test (psi-square) and Generalized Serial Test (delta psi-square) for equiprobability of v-nomes (or v-plets/bits) (Good's and Kendall-Babington Smith's tests)

# SYNOPSIS

``` use Statistics::Sequences::Vnomes;
\$vnomes = Statistics::Sequences::Vnomes->new();
\$vnomes->test(length => 2)->dump();```

# DESCRIPTION

This module implements tests of the independence of successive elements of a sequence/series of data (list, vector, etc.) - specifically, "serial tests" for v-nomes (a.k.a v-plets or, for binary data, v-bits) - what are call singlets/monobits, dinomes/doublets, trinomes/triplets, etc..

Serial tests tell us if all the variations of the states, of a certain sub-sequence length, v, that would be possible in the population from which the series has been sampled, are equally represented in the sample. For example, a sequence sampled from a "heads'n'tails" (H and T) distribution can be tested for its equal representation of the trinomes HTH, HTT, TTT, THT, and so on. Counting up these v-nomes at all points in the series, permitting overlaps, yields a statistic - psi-square - that is approximately distributed as chi-square; the Kendall-Babington Smith statistic.

However, because these counts are not independent (given the overlaps), Good's Generalized Serial Test is the default test-statistic returned by this module's `test` routine: It computes psi-square by differencing, viz., in relation to not only the specified `length`, or value of v, but also its value for the first two prior lengths of v, yielding a statistic, delta-square-psi-square (the "second backward difference" measure) that is exactly distributed as chi-square.

The test is suitable for multi-state data, not only the binary, dichotomous series suitable for the Runs and Joins tests in this package. It can also be used to test that the individual elements in the list are uniformly distributed, that the states are equally represented, i.e., as a chi-square-based frequency test (a.k.a. test of uniformity, equiprobability, equidistribution). This is done by giving a `length` of 1, i.e., testing for mononomes.

Note that this is not the so-called serial test described by Knuth (1998, Ch. 2), which involves non-overlapping pairs of sequences. Given this variety of definitions of what is a "serial test," this module - like that for Runs, Pot, etc. - is named after the basic construct tested - i.e., v-nomes - rather than after any property of v-nomes (seriality, successive independence, etc.).

# METHODS

## new

` \$vnomes = Statistics::Sequences::Vnomes->new();`

Returns a new Vnomes object. Expects/accepts no arguments but the classname.

``` \$vnomes->load(@data);
\$vnomes->load('dist1' => \@data1, 'dist2' => \@data2)
\$vnomes->load({'dist1' => \@data1, 'dist2' => \@data2})```

Loads data anonymously or by name. See load in the Statistics::Sequences manpage.

## test

` \$vnomes->test(length => ?integer?, delta => '1|0', circularize => '1|0', states => [qw/A C G T/]);`

Performs the Kendall-Babington Smith or (by default) Good's Generalized Serial Test of v-nomes on the given or named distribution, yielding a psi-square statistic.

To test for the significance of the psi-square statistic, the raw psi-square value for sub-sequences of length v is, by default, not used - because, unless length (v) = 1, psi-square is not asymptotically distributed as chi-square. However, the differences between psi-square values for backwardly adjacent values of length (v) are asymptotically distributed as chi-square. By default, then, a "second backward differences" psi-square value is calculated, named (as per Good, 1953) as `del2psi2`, which makes use of the psi-square values for sub-sequences of length v, v - 1, and v - 2. This statistic is logically (and has been mathematically shown to be) not only chi-square distributed, but to offer statistically independent counts of all the possible variations of sequences of length for the series in question.

Note that the "first backward differences" of psi-square, which is the difference between the psi-square values for sub-sequences of length v and length v - 1, is not ordinarily returned. While it is chi-square distributed, counts of such first-differences are not statistically independent (Good, 1953; Good & Gover, 1967). This value can, however, be returned in place of the default by specifying `delta` => 1. But note: "the sequence of second differences forms a much better set of statistics for testing the hypothesis of flat-randomness" (Good & Gover, 1967, p. 104) [compared to the first differences].

For consistency with other modules in the Statistics::Sequences package, the Z-value associated with psi-square's probability is also computed, and this is what is printed in dumps. You can get the psi-square statistic itself, however, as \$vnomes->{'psisq'}, noting that it will, according to your value of `delta`, either be properly called del2psi2, delpsi2 or psi2.

The algorithm implemented for psi-square is that given by Good (1953, Eq. 1); benchmarking shows no reliable speed difference to alternative forms of the equation, as given by Good (1957, Eq. 2) and in the NIST test suite (Rukhin et al., 2001). Good's original algorithm can also be found in individual papers describing the application of the Serial Test (e.g., Davis & Akers, 1974).

By default, the p-value associated with the test-statistic is 2-tailed. See the Statistics::Sequences manpage for generic options other than the following Vnome test-specific ones. At the end of the test, the class object is lumped with the usual statistics; this time, however, the value of observed is the average of the observed frequencies of each v-nome, and an additional statistic, observed_stdev, the standard deviation of the observed frequencies is also formed.

### Options

length

The length of the v-nome, i.e., the value of v. Must be an integer greater than or equal to 1, and smaller than than the sample-size.

What is a meaningful maximal value of `length`? As a chi-square test, it could be held that there should be an expected frequency of at least 5 for each v-nome. This is "conventional wisdom" recommended by Knuth (1988) but can be judged to be too conservative (Delucchi, 1993). The NIST documentation on the serial test (Rukhin et al., 2001) recommends that length should be less than the floored value of log2 of the sample-size, minus 2. No tests are here made of these recommendations, but if you choose to "dump" your results with verbosity (see `dump`), you will get a note if the NIST warning would apply.

circularize

By default, circularizes the data series; i.e., the datum after the last element is the first element. This affects (and slightly simplifies) the calculation of the expected frequency of each v-nome, and so the value of each psi-square. Circularizing ensures that the expected frequencies are accurate; otherwise, they might only be approximate. As Good and Gover (1967) offer, "It is convenient to circularize in order to get exact checks of the arithmetic and also in order to simplify some of the theoretical formulae" (p. 103).

delta

By default, the statistics are based on the second backward difference of psi-squares, i.e., as the Generalized Serial Test, as described by Good, see REFERENCES. If delta => 0, the original Kendall-Babington Smith statistic (psi-square) is used. If delta => 1, the value of `psisq` is the first backward difference (delpsi2), as described above.

states

A referenced array listing the unique states (or 'events' or 'letters') in the population from which the series was sampled. This is useful to specify if the series itself is likely not to include all the possible states; it might even include only one of them. If this array is not specified, the unique states are identified from the series itself - in which case there ought to be at least two states in the series. Having only one state specified is not permissible. If giving a list of states, a check in each test is made to ensure that the data series contains only those elements in the list.

## dump

` \$vnomes->dump(flag => '1|0', text => '0|1|2');`

Print Vnome-test results to STDOUT. See dump in the Statistics::Sequences manpage for details. For comparability with other modules in the Statistics::Sequences package, the Z-value associated (by Math::Cephes::ndtri) with the obtained p-value is reported. If `text` => 2, then you get a verbose dump, including (1) the actual test-statistic depending on the value of `delta` tested (`del2psi2` for the second difference measure (default), `delpsi2` for the first difference measure, and `psi2` for the raw measure), followed by degrees-of-freedom in parentheses; and (2) a warning, if relevant, that your `length` value might be too large with respect to the sample size (see NIST reference, above, in dicussing `length`). If `text` => 1, you just get the average observed and expected frequencies for each v-nome, the Z-value, and its associated p-value.

If you want to access the actual psi-square value that was tested against the chi-square distribution, you can retrieve it as \$vnomes->{'psisq'}.

After testing, parameters named 'nstates' (the number of states), 'samplings' (the size of the sample), 'length' (what you requested) can be retrieved from the class object. You can retrieve the counts for each of the Vnomes in the series as a hash-reference named 'counts' in the class object, e.g.:

``` print "No. of \$vnomes->{'length'}-nome variations of \$vnomes->{'nstates'} states among \$vnomes->{'samplings'} samplings:\n";
printf("\t%s\t%d\n", \$_, \$vnomes->{'counts'}->{\$_}) foreach sort keys %{\$vnomes->{'counts'}};```

# EXAMPLE

## Seating at the diner

This is the data from Swed and Eisenhart (1943) also given as an example for the Runs test and Turns test. It lists the occupied (O) and empty (E) seats in a row at a lunch counter. Have people taken up their seats on a random basis? The Runs test suggested some non-random basis for people to take their seats, ouputting (per dump with text => 1):

`  Runs: observed = 11.00, expected = 7.88, z = 1.60, 1p = 0.054834`

So there were more runs, or greater serial discontinuity, than expected. What does the test of Vnomes tell us?

``` use Statistics::Sequences::Vnomes;
my \$vnomes = Statistics::Sequences::Vnomes->new();
my @seating = (qw/E O E E O E E E O E E E O E O E/);
\$vnomes->test(length => 2)->dump();```

This outputs, as returned by `string`:

` Z = -0.475232849247084, 2p = 0.317310507862914`

That is, the observed frequency of each possible pair of seating arrangements (the dinomses OO, OE, EE, EO) did not differ significantly from that expected. Taking a bigger picture and widening our perspective, though, changing the value of `length` to 3 so as to assess the equal representation of all possible trinomes, yields:

` Z = -1.70672129474387, 2p = 0.0439369336234074`

# REFERENCES

Davis, J. W., & Akers, C. (1974). Randomization and tests for randomness. Journal of Parapsychology, 38, 393-407.

Delucchi, K. L. (1993). The use and misuse of chi-square: Lewis and Burke revisited. Psychological Bulletin, 94, 166-176.

Good, I. J. (1953). The serial test for sampling numbers and other tests for randomness. Proceedings of the Cambridge Philosophical Society, 49, 276-284.

Good, I. J. (1957). On the serial test for random sequences. Annals of Mathematical Statistics, 28, 262-264.

Good, I. J., & Gover, T. N. (1967). The generalized serial test and the binary expansion of [square-root]2. Journal of the Royal Statistical Society A, 130, 102-107.

Kendall, M. G., & Babington Smith, B. (1938). Randomness and random sampling numbers. Journal of the Royal Statistical Society, 101, 147-166.

Knuth, D. E. (1998). The art of computer programming (3rd ed., Vol. 2 Seminumerical algorithms). Reading, MA, US: Addison-Wesley.

Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, S., et al. (2001). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Retrieved September 4 2010, from http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP800-22b.pdf.

Statistics::Sequences for other tests of sequences, and for sharing data between these tests.

# TO DO/BUGS

Implementation of the serial test for non-overlapping v-nomes.

# REVISION HISTORY

v.051: Default test-statistic reported in "dumps" is now the Z-value associated with the psi-square p-value; whatever the manner of calculating psi-square (by differencing or not), it can itself be retrieved as \$vnomes->{'psisq'}, and is also dumped, by name, if `text` => 2 in a call to `dump`.

See CHANGES in installation dist for revisions.

rgarton AT cpan DOT org

This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).

# DISCLAIMER

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.

# END

This ends documentation of the Perl implementation of the chi-square, Kendall-Babington Smith, and Good's Generalized Serial Test for randomness in a sequence.

syntax highlighting: