The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
package Statistics::Sequences::Vnomes;
use strict;
use warnings FATAL => 'all';
use Carp 'croak';
use Algorithm::Combinatorics qw(variations_with_repetition);
use Math::Cephes 0.43;
use Statistics::Sequences 0.13;
use Statistics::Lite qw(sum);
use base qw(Statistics::Sequences);
$Statistics::Sequences::Vnomes::VERSION = '0.20';

=pod

=head1 NAME

Statistics::Sequences::Vnomes - Serial Test psi-square for equiprobability of v-nomes (or Ngrams) (Good's and Kendall-Babington-Smith's tests)

=head1 SYNOPSIS

 use Statistics::Sequences::Vnomes 0.20;
 my $vnomes = Statistics::Sequences::Vnomes->new();
 $vnomes->load(qw/1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 1/); # ordered data, numerical or other, binary or what
 my $freq_href = $vnomes->observed(length => 3); # returns hashref of frequencies for trinomes in the sequence
 my @freq = $vnomes->observed(length => 3); # returns only observed trinome frequencies (not keyed by trinomes themselves)
 $val = $vnomes->expected(length => 3); # mean chance expectation for the frequencies (2.5)
 $val = $vnomes->psisq(length => 3); # Good's "second backward differences" psi-square (3.161); see option 'delta'
 $val = $vnomes->p_value(length => 3); # for psi-square (0.206)
 my $href = $vnomes->stats_hash(length => 3, values => {psisq => 1, p_value => 1}); # include any method (& their options)
 $vnomes->dump(length => 3,
  values => {observed => 1, expected => 1, psisq => 1, p_value => 1}, # what stats to show (or not if => 0)
  format => 'table', flag => 1, precision_s => 3, precision_p => 7, verbose => 1);
  # prints:
   # Vnomes (3) statistics
   #.-----------+-----------+-----------+-----------.
   #| observed  | expected  | p_value   | psisq     |
   #|           |           |           |           |
   #+-----------+-----------+-----------+-----------+
   #| '011' = 4 | 2.500     | 0.2058681 | 3.161     |
   #| '010' = 1 |           |           |           |
   #| '111' = 2 |           |           |           |
   #| '000' = 1 |           |           |           |
   #| '101' = 2 |           |           |           |
   #| '001' = 3 |           |           |           |
   #| '100' = 3 |           |           |           |
   #| '110' = 4 |           |           |           |
   #'-----------+-----------+-----------+-----------'
   # psisq is Good's delta^2-psi^2, calculated with second backward differences, and has 2 degrees of freedom.

 # following and other methods inherited from Statistics::Sequences and its parent, Statistics::Data:
 $vnomes->dump_data(delim => ','); # comma-separated single line of the loaded data
 $vnomes->save_to_file(path => 'seq.csv'); # for retrieval by load_from_file() method

=head1 DESCRIPTION

This module implements the Kendall-Babington-Smith and Good's serial tests of the independence of successive elements within a sequence. At the least, it can be used to test that the individual elements in the sequence are uniformly distributed--when giving a B<length> of 1 to the function L<psisq|Statistics::Sequences::Vnomes/psisq>. More generally, an array of data is given as an ordered categorical ("stringy") series of events, and a description of this array is wanted in terms of the frequency of its constituent "blocks" of length I<v>, and a test is made of how these counts differ from the counts expected for a sequence generated by the same process, for as many events, of the same length.

Successive elements are defined as I<v>-nomes, a.k.a I<N>-grams, I<v>-plets or I<v>-bits (for binary data); that is, as mononomes, monograms, singletons, or monobits of discrete events; or dinomes, bigrams, doublets, or so on for immediate repetitions and alternations; or as trinomes, trigrams, triplets, etc., and so on, for higher-order sequences. 

The test implemented here is an alternative to the monobit "frequency" test, and the I<chi>-square and likelihood ratio (L<G-square|Statistics::Gtest>) tests of independence. These tests generally produce values that are only asymptotically distributed as I<chi>-square; a problem solved by Good in offering a I<chi>-square informed by backward differencing. 

Take the following, for example. A sequence of heads and tails (H and T) is generated by coin-flipping. A serial test of the randomness (informativeness, predictability, instability, etc.) of this sequence might involve testing for equal representation of, say, the trinomes HTH, HTT, TTT, THT, and so on. Counting up these trinomes at overlapping points in the sequence (with a moving time window) yields a statistic--I<psi-square>--that is approximately distributed as I<chi>-square; the Kendall-Babington Smith statistic. However, these counts are not independent (given overlaps). Good's Generalized Serial Test--the default test-statistic returned by this module's L<psisq|Statistics::Sequences::Vnomes/psisq> routine--computes I<psi>-square by differencing, taking into account not only the specified B<length> of I<v> but also its value for the first two prior lengths (dinomes and mononomes). This yields the statistic I<delta-square-psi-square> (the "second backward difference" measure) that is I<exactly> distributed as I<chi>-square. (Anything less than a test for trinomes naturally has to fall back on first backward differencing or no differencing at all--which, again, is the Kendall-Babington-Smith test).

Note also that (1) Good's serial test, as here implemented, is suitable for multi-state, multinomial data--not only binary, dichotomous sequences (and to which, say, the L<runs|Statistics::Sequences::Runs> or L<joins|Statistics::Sequences::Joins> tests are limited); that (2) Good's serial test is I<not> the same as the so-called "serial test" described by Knuth (1998, Ch. 2)--which involves I<non>-overlapping pairs of events within a sequence; and that (3) Good's original paper for this test (i.e., Good, 1953) misprinted the crucial calculation--i.e., without squaring the observation less expectation difference--so that versions of this module pre-0.20 (which implemented the former) are not compatible (to say the least) with earlier versions.

=head1 METHODS

This module is a "child" of L<Statistics::Sequences|Statistics::Sequences>, and so of L<Statistics:Data|Statistics:Data>. So it offers generic methods, as follows, aside from the specific methods for describing a sequence as per Good's test.

=head2 new

 $vnomes = Statistics::Sequences::Vnomes->new();

Returns a new Vnomes object. Expects/accepts no arguments but the classname.

=head2 load

 $vnomes->load(@data); # anonymously
 $vnomes->load(\@data);
 $vnomes->load('sample1' => \@data); # labelled whatever

Loads data anonymously or by name - see L<load|Statistics::Data/load, add, unload> in L<Statistics::Data> for ways data can be loaded and retrieved (more than shown here). Every load unloads all previous loads and any additions to them.

Data for this test of sequences can be categorical or numerical, all being treated categorically. Also, the data do not have to be dichotomous (unlike in tests of L<runs|Statistics::Sequences::Runs> and L<joins|Statistics::Sequences::Joins>.

=head2 add, access, unload

See L<Statistics::Data> for these and other operations on loaded data.

=head2 observed

 $href = $vnomes->observed(length => INT > 0, circularize => BOOL); # returns keyed distribution; assumes data have already been loaded
 @ari = $vnomes->observed(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => INT > 0, circularize => BOOL); # returns frequencies only

Returns the frequency distribution for the I<observed> number of v-nomes of the given B<length> in a sequence. These are counted up as overlapping. Called in array context, returns an array of the counts for each possible pattern; otherwise, where these are keyed by the pattern in a hash reference. For descriptives of the observed frequencies, try calling L<Statistics::Lite|Statistics::Lite> methods with this array.

A value for B<length> greater than zero is required, and must be no more than the sample-size. So for the sequence

 1 0 0 0 1 0 0 1 0 1 1 0 

there are 12 v-nomes of length 1 (mononomes, the number of elements) from the two (0 and 1) that are possible; 11 dinomes from the four (10, 00, 00, 01) that are possible; and 10 trinomes from eight (100, 000, 001, 010, etc.) that are possible.

By default, the sequence is counted up for v-nomes as a cyclic sequence, treating the first element of the sequence as following the last one. So the count loops to the beginning of the sequence until all elements from the end are included, and instead of ending the count for trinomes in this sequence at '110', the count includes '101' and '010', increasing the observed sum of trinomes to 12. Set B<circularize> => 0 if the count is I<not> to be made cyclically.

The data to test can already have been L<load|Statistics::Sequences/load, add, unload>ed, or you send it directly keyed as B<data>.

In the code for this and/or other methods, following the work of Good (1953, 1957; Good & Gover, 1967), I<v> is used to define variables referring to the length of the subsequence (I<mono>nome, I<di>nome, I<tri>nome, etc.), I<t> defines the number of states (events, letters, etc.) in the sequence, and I<r> identifies the number of each possible subsequence of length I<v>.

=cut

sub observed {
    my $self = shift;
    my $args = ref $_[0] ? shift : {@_};
    my $data = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    my $v    = $args->{'length'} or croak __PACKAGE__,
      '::observed v-nome length needs to be defined and greater than zero';
    my $circularize =
      defined $args->{'circularize'} ? $args->{'circularize'} : 1;
    my $states_aref = _get_stateslist( $data, $args->{'states'} ); # list unique states as given or from sequence itself
    my $v_aref = _get_v_ari($v);
    my ( @data_i, %psisq_v ) = ();

    foreach my $v_i (@{$v_aref} )  { # loop through v, v-1 and v-2 lengths ...
        @data_i = @{$data}; # append first $v_i - 1 elements to end
        push( @data_i, @data_i[ 0 .. $v_i - 2 ] ) if $circularize; # e.g., if v = 3, append elements [0] & [1]; if v = 2, append only 1st value
        $psisq_v{$v_i} = _frequencies( \@data_i, $v_i, $states_aref ); # count of each sequence of length $v_i
        last if $v_i == $v;
    }
    return wantarray ? values %{ $psisq_v{$v} } : $psisq_v{$v};
}
*vnomes_observed = \&observed;
*vno             = \&observed;

=head2 expected

 $count = $vnomes->expected(length => INT > 0 > 0, circularize => BOOL); # assumes data have already been loaded
 $count = $vnomes->expected(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => INT > 0, circularize => BOOL, states => [0, 1]);

Returns the I<expected> number of observations for each v-nome of the given B<length> in a sequence; i.e., the mean chance expectation, assuming that each event is generated by a random uniform process. Options are as for L<observed|Statistics::Sequences::Vnomes/observed>. This expected frequency is given by:

=for html <p>&nbsp;&nbsp;<i>E[V]</i> = <i>N</i><i>t</i><sup>&ndash;<i>v</i></sup>

where I<t> is the number of possible states (alternatives; as read from the data or as explicitly given as an array in B<states>), I<v> is the v-nome length, and I<N> is the length of the sequence, less I<v> + 1 if the count is not to be circularized (Good, 1953, Eq. 12).

Another way to think of this is as the number of mononome observations in the sequence divided by the number of possible permutations of its states for the given length. So, for a sequence made up of 0s and 1s, there are four possible variations of length 2 (00, 10, 01 and 11), so that the expected frequency for each of these variations in a sequence of 20 values is  20 / 4, i.e., 5.

This is a theoretical (deductive) estimate based on the given states - not an empirical (inductive) estimate based on the given data. So all the expected frequencies for each combination are equal, unlike what you might get by way of conditional probabilities for combinations of states from the given data.

=cut

sub expected {
    my $self = shift;
    my $args = ref $_[0] ? shift : {@_};
    my $num  = _get_n( $self, $args );
    my $v    = $args->{'length'} or croak __PACKAGE__,
      '::expected v-nome length needs to be defined and greater than zero';

    # Sanitize length in proportion to size:
    croak __PACKAGE__,
"::expected Sequence length v ($v) for testing should be no more than the sample-size ($num) - 2"
      if $v >= $num - 1; # or permit $v to extend to limit of circularized data?
     #my $lim =  Math::Cephes::floor(Math::Cephes::log2($num)) - 2; # re NIST recommmendation that $v is too small for $num and $t if $v >= $lim
    my $circularize =
      defined $args->{'circularize'}
      ? $args->{'circularize'}
      : 1;    # circularize by default
              # Compute expected freq of variations of $t of length $v:
    $num = _count_lim( $num, $v ) if !$circularize;
    my $p = $self->prob_r( length => $v );
    return $num * $p;
}
*vnomes_expected = \&expected;
*vne             = \&expected;

=head2 variance

 $var = $vnomes->variance(length => INT > 0, circularize => BOOL); # assumes data have already been loaded
 $var = $vnomes->variance(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => INT > 0, circularize => BOOL, states => [0, 1]);

Returns the variance in the expected frequency of each v-nome, as per Good (1953, Eqs. 11 and 14).

=cut

sub variance {

    #croak 'variance() method is not defined for ' . __PACKAGE__;
    my $self = shift;
    my $args = ref $_[0] ? shift : {@_};
    my $num  = _get_n( $self, $args );
    my $v    = $args->{'length'} or croak __PACKAGE__,
      '::variance v-nome length needs to be defined and greater than zero';

    # Sanitize length in proportion to size:
    croak __PACKAGE__,
"::variance Sequence length v ($v) for testing should be no more than the sample-size ($num) - 2"
      if $v >= $num - 1; # or permit $v to extend to limit of circularized data?
     #my $lim =  Math::Cephes::floor(Math::Cephes::log2($num)) - 2; # re NIST recommmendation that $v is too small for $num and $t if $v >= $lim
     # Compute expected freq of variations of $t of length $v:
    my $pr = $self->prob_r( length => $v );
    my $var = $pr;
    my ( $sum_1, $sum_2, $m, $pm ) = ( 0, 0 );
    foreach $m ( 1 .. $v - 1 ) {
        $pm = $self->prob_r( length => $m );
        $sum_1 += $pm;
        $sum_2 += ( $m + $v - 1 ) * $pm;
    }
    $var += 2 * $pr * $sum_1;
    $var -= ( 2 * $v - 1 ) * $pr**2;
    my $circularize =
      defined $args->{'circularize'} ? $args->{'circularize'} : 1;
    $var +=
      $num**-1 * $pr *
      ( ( ( $v - 1 ) * ( 3 * $v - 1 ) * $pr ) - ( $v - 1 ) - 2 * $sum_2 )
      if !$circularize;
    return $var;
}

sub obsdev {
    croak 'obsdev() method is not defined for ' . __PACKAGE__;
}
*observed_deviation = \&obsdev;

=head2 stdev

 $var = $vnomes->stdev(length => INT > 0, circularize => BOOL); # assumes data have already been loaded
 $var = $vnomes->stdev(data => [1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1], length => INT > 0, circularize => BOOL, states => [0, 1]);

Returns the standard deviation in the expected frequency of each v-nome, as per Good (1953, Eqs. 11 and 14).

=cut

sub stdev {
    return sqrt( variance(@_) );
}
*standard_deviation = \&stdev;

=head2 psisq

 $psisq = $vnomes->psisq(length => INT > 0, delta => '2|1|0', circularize => '1|0', states => [qw/A C G T/]);
 ($psisq, $df, $p_value) = $vnomes->psisq(length => INT > 0, delta => '2|1|0', circularize => '1|0', states => [qw/A C G T/]);

Performs Good's Generalized Serial Test (by default), of v-nomes on the given or named distribution, yielding a I<psi-square> statistic. Returns the statistic itself, and, if called in array context, also the degrees of freedom. The option B<delta> specifies backward differencing.

=over 4

=item delta =E<gt> 0

The raw psi-square value for sub-sequences of length I<v> (Kendall-Babington Smith statistic), i.e., without backward differencing.

=for html <p>&nbsp;&nbsp;&nbsp; &Psi;&sup2;<sub>v</sub> = &sum;{<i>r</i>} ( <i>n<sub>r</sub></i>&ndash; (<i>N</i> &ndash; <i>v</i> + 1) / <i>t<sup>v</sup></i> )&sup2; ) / ( (<i>N</i> &ndash; <i>v</i> + 1) / <i>t<sup>v</sup></i> )</p>

for uncircularized sequences (Good, 1953, Eq. 1), and

=for html <p>&nbsp;&nbsp;&nbsp; <span style="text-decoration:overline;">&Psi;</span>&sup2;<sub>v</sub> = &sum;{<i>r</i>} ( <i><span style="text-decoration:overline;">n</style><sub>r</sub></i>&ndash; <i>Nt<sup>&ndash;v</sup></i> ) / <i>Nt<sup>&ndash;v</sup></i></p>

for circularized sequences (Good, 1953, Eq. 2), where an overline indicates circularization, and where I<N> is the length of the sequence, I<v> is the v-nome length, I<t> is the number of unique states that each element of the sequence can take, and I<r> is the number of variations of length I<v> for the given number of unique states.

This statistic is only asymptotically distributed as I<chi>-square if I<length> (I<v>) = 1, and is therefore not used as the default.

=item delta =E<gt> 1

The "I<first> backward differences" of psi-square, which is the difference between the psi-square values for sub-sequences of length I<v> and length I<v> - 1.

=for html <p>&nbsp;&nbsp;&nbsp; &Delta;&Psi;&sup2;<sub>v</sub> = &Psi;&sup2;<sub>v</sub> &ndash; &Psi;&sup2;<sub>v&ndash;1</sub> (<i>v</i> &ge; 1)</p>

While it is I<chi>-square distributed, counts of first-differences are not statistically independent (Good, 1953; Good & Gover, 1967), and "the sequence of second differences forms a much better set of statistics for testing the hypothesis of flat-randomness" (Good & Gover, 1967, p. 104).

=item delta =E<gt> 2 (or undef)

This method returns by default (without specifying a value for B<delta>), or if B<delta> is not 1 or 0, the "second backward differences" psi-square (I<delta^2-psi^2>). This incorporates psi-square values for backwardly adjacent values of I<length> (I<v>), i.e., for sub-sequences of length I<v>, I<v> - 1, and I<v> - 2.

=for html <p>&nbsp;&nbsp;&nbsp; &Delta;&sup2;&Psi;&sup2;<sub>v</sub> = &Psi;&sup2;<sub>v</sub> &ndash; 2&Psi;&sup2;<sub>v&ndash;1</sub> &ndash; &Psi;&sup2;<sub>v&ndash;2</sub> (<i>v</i> &ge; 2)</p>

It is not only asymptotically I<chi>-square distributed, but uses statistically independent counts of all the possible variations of sequences of the given B<length> (Good, 1953).

=back

See also Good's algorithm in individual papers describing application of the Serial Test (e.g., Davis & Akers, 1974).

=cut

sub psisq {
    my $self = shift;
    my $args = ref $_[0] ? shift : {@_};
    my $data = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    my $num  = _get_n( $self, $args );
    my $v    = $args->{'length'} or croak __PACKAGE__,
      '::psisq v-nome length needs to be defined and greater than zero';

    # Sanitize length in proportion to size:
    croak __PACKAGE__,
"::psisq Sequence 'length' ($v) for testing should be no more than the length of the sequence ($num) minus 2"
      if $v >= $num - 1;

#my $lim =  Math::Cephes::floor(Math::Cephes::log2($num)) - 2; # re NIST recommmendation that $v is too small for $num and $t if $v >= $lim
    my $circularize =
      defined $args->{'circularize'} ? $args->{'circularize'} : 1;
    my $delta = defined $args->{'delta'} ? $args->{'delta'} : 2;
    my $t = _get_t( $self, $args );
    my $states_aref = _get_stateslist( $data, $args->{'states'} )
      ;    # Get list of unique states as given or from sequence itself
    my $v_aref = _get_v_ari($v);

   # While looping through tests of v, v-1 and v-2 (= $v_i) sequence lengths ...
    my ( $v_i, $r_freq, @data_i, %psisq_v ) = ();
    foreach $v_i (@$v_aref) {

# counting vnome frequencies: handle circularization by firstly appending elements from the start to the sequence's end:
        if ($circularize) {
            @data_i = @{$data};    # copy the original sequence
            push( @data_i, @data_i[ 0 .. $v_i - 2 ] )
              ; # e.g., if v = 3, append elements [0] & [1]; if v = 2, append only 1st value
            $r_freq = _frequencies( \@data_i, $v_i, $states_aref )
              ;    # frequencies of each sequence of length $v_i
            $psisq_v{$v_i} = _psisq_circ( scalar @data_i, $t, $v_i, [values %{$r_freq}] );
        }
        else {
            $r_freq = _frequencies( $data, $v_i, $states_aref );
            $psisq_v{$v_i} = _psisq_uncirc( scalar @$data, $t, $v_i, [values %{$r_freq}] );
        }
    }
    my ( $psisq, $df ) = _sel_psisq_and_df( $v, $t, \%psisq_v, $delta );
    if (wantarray) {
        return ( $psisq, $df, Math::Cephes::igamc( $df / 2, $psisq / 2 ) );
    }
    else {
        return $psisq;
    }
}

=head2 p_value

 $p = $vnomes->p_value(length => INT > 0); # using loaded data and default args
 $p = $vnomes->p_value(length => INT > 0, data => [1, 0, 1, 1, 0], exact => 1); #  using given data (by-passing load and access)
 $p = $vnomes->p_value(length => INT > 0, trials => 20, observed => 10); # without using data

Returns probability of obtaining the psisq value for data already L<load|Statistics::Sequences/load, add, unload>ed, or directly keyed as B<data>. The I<p>-value is read off the complemented chi-square distribution (incomplete gamma integral) using L<Math::Cephes|Math::Cephes> C<igamc>.

=cut

sub p_value {
    my $self = shift;
    my $args = ref $_[0] ? shift : {@_};
    my ( $psisq, $df ) = $self->psisq($args);
    return Math::Cephes::igamc( $df / 2, $psisq / 2 );
}
*test        = \&p_value;
*vnomes_test = \&p_value;
*vnt         = \&p_value;

=head2 dump

 $vnomes->dump(length => 3, values => {psisq => 1, p_value => 1}, format => 'table|labline|csv', flag => 1, precision_s => 3, precision_p => 7, verbose => 1);

Print Vnome-test results to STDOUT. See L<dump|Statistics::Sequences/dump> in the Statistics::Sequences manpage for details. If B<verbose> => 1, then you get (1) the actual test-statistic depending on the value of B<delta> tested (I<delta^2-psi^2> for the second difference measure (default), I<delta-psi^2> for the first difference measure, and I<psi2> for the raw measure), followed by degrees-of-freedom in parentheses; and (2) a warning, if relevant, that your B<length> value might be too large with respect to the sample size (see NIST reference, above, in discussing B<length>). If B<text> => 1, you just get the average observed and expected frequencies for each v-nome, the I<Z>-value, and its associated I<p>-value.

=cut

sub dump {
    my $self = shift;
    my $args = ref $_[0] ? $_[0] : {@_};
    $args->{'stat'} = 'vnomes';
    $args->{'stat'} .= " ($args->{'length'})" if defined $args->{'length'};
    $self->SUPER::dump($args);
    if ( $args->{'verbose'} && $args->{'format'} eq 'table' ) {
        if ( $args->{'values'}->{'psisq'} ) {
            my $delta = defined $args->{'delta'} ? $args->{'delta'} : 2;
            my ( $psisq, $df ) = $self->psisq($args);
            my $df_str = 'degree';
            $df_str .= 's' if $df != 1;
            my ($statname, $diff_type) = ();
            if ( $delta == 0 ) {    # Raw psisq:
                $statname = 'the Kendall-Babington Smith statistic';
                $diff_type = 'without backward differencing';
            }
            elsif ( $delta == 1 ) {    # First backward difference:
                $statname = 'Good\'s delta-psi^2';
                $diff_type = 'with first backward differences';
            }
            else {
                $statname = 'Good\'s delta^2-psi^2';
                $diff_type = 'with second backward differences';
            }
            printf("psisq is %s, calculated with %s, and has %s %s of freedom.\n", $statname, $diff_type, $df, $df_str);
            
        }
    }
    return $self;
}

=head2 nnomes

 $r = $vnomes->nnomes(length => INT > 0, data => AREF); # supply the data directly, and assume all possible states are in the given sequence
 $r = $vnomes->nnomes(length => INT > 0); # assuming data have been "loaded" and all possible states are in the loaded sequence
 $r = $vnomes->nnomes(length => INT > 0, states => AREF); # ... or specify the states in case not all in the sequence

Returns the number of possible subsequences of the given length (I<v>) for the given number of states (I<t>). This is the quantity denoted as I<r> in Good's (1953, 1957) papers; i.e.,

=for html <p>&nbsp;&nbsp;<i>r(v)</i> = <i>t</i><sup><i>v</i></sup>

The method needs to have, of course, a sequence a test. This can be directly given to the method (as the referenced array for the named argument B<data>) or it can be pre-loaded, as by using $vnomes->load(AREF).

Then, the method needs to know two things: the "v" value itself, i.e., the length of the possible subsequences to test (I<mono>nomes, I<di>nomes, I<tri>nomes, etc.), and the number of states (events, letters, etc.) that the process generating the data could take (from 1 to whatever). The "v" value is I<always> required to be specified by the named argument B<length>, and it should be a positive integer value no greater than the length of the (loaded/given) sequence. The states can be directly given in the named argument B<states> (recommended), or from the states that "empirically" exist in the loaded/given B<data>. 

=cut

sub nnomes {
    my $self = shift;
    my $args = ref $_[0] ? $_[0] : {@_};
    my $t    = _get_t( $self, $args );
    my $v    = $args->{'length'} or croak __PACKAGE__,
      '::nnomes v-nome length needs to be defined and greater than zero';
    return $t**$v;
}

=head2 prob_r

 $Pr = $vnomes->prob_r(length => $v); # length is 1 (mononomes) by default

Returns the probability of the occurrence of any of the individual elements ("digits") in the sequence (I<v> = 1), or of the given B<length>, assuming they are equally likely and independent.

=for html <p>&nbsp;&nbsp;<i>P<sub>r</sub></i> = <i>t</i><sup><i>&ndash;v</i></sup>

=cut

sub prob_r {
    my $self = shift;
    my $args = ref $_[0] ? $_[0] : {@_};
    my $t    = _get_t( $self, $args );
    my $v    = $args->{'length'} || 1;
    return $t**( -1 * $v );
}

=head1 OPTIONS

Options common to the above stats methods.

=head2 length

This is currently a I<required> "option", giving the length of the I<v>-nome of interest, i.e., the value of I<v> - an integer greater than or equal to 1, and smaller than than the sample-size.

What is a meaningful maximal value of B<length>? As a I<chi>-square test, it is conventionally required that the expected frequency is at least 5 for each v-nome (Knuth, 1988). This can be judged to be too conservative (Delucchi, 1993). The NIST documentation on the serial test (Rukhin et al., 2010) recommends that length should be less than the floored value of log2 of the sample-size, minus 2. No tests are here made of these recommendations.

=head2 circularize

By default, L<observed|Statistics::Sequences::Vnomes/observed, vnomes_observed, vno> and L<expected|Statistics::Sequences::Vnomes/expected, vnomes_expected, vne> counts, and the value of L<psisq|Statistics::Sequences::Vnomes/psisq>, are made by treating the sequence as a cyclic one, where the first element of the sequence follows the last one. This affects (and simplifies) the calculation of the expected frequency of each v-nome, and so the value of each psi-square. Also, circularizing ensures that the expected frequencies are accurate; otherwise, they might only be approximate. As Good and Gover (1967) state, "It is convenient to circularize in order to get exact checks of the arithmetic and also in order to simplify some of the theoretical formulae" (p. 103). These methods, however, can also treat the sequence non-cyclically by calling them with B<circularize> => 0.

=head2 states

Optionally send a referenced array listing the unique states (or 'events', 'letters') in the population from which the sequence was sampled, e.g., B<states> => [qw/A C G T/]. This is useful if the sequence itself might not include all the possible states. If this is not specified, the states are identified from the sequence itself. If giving a list of states, a check in each test is made to ensure that the sequence contains I<only> those elements in the list.

=cut

# PRIVATMETHODEN
sub _count_lim
{ # N - v + 1; if N = 5 and v = 3, starting the count-up for v-nomes must stop from the first 3 elements of the data
    return $_[0] - $_[1] + 1;
}

sub _get_n {    # the size of the sequence
    my ( $self, $args ) = @_;
    my $data = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
    return scalar @$data;
}

sub _get_t
{ # the number of unique states; e.g., 2 for a binary sequences; 10 for a typical random digit sequence (0 .. 9)
    my ( $self, $args ) = @_;
    my $t;
    if ( ref $args->{'states'} and ref $args->{'states'} eq 'ARRAY' ) {
        $t = scalar( @{ $args->{'states'} } );
    }
    else {
        my $data = ref $args->{'data'} ? $args->{'data'} : $self->access($args);
        my %hash = map { $_, 1 } @{$data};
        $t = scalar keys %hash;
    }
    return $t;
}

sub _get_stateslist {
    my ( $data, $states ) = @_;
    if ( !ref $states ) {    # Get states from the data themselves:
        my %hash = map { $_, 1 } @{$data};
        $states = [ keys %hash ];
    }
    else {    # Ensure that the data only contain states in the given list:
      DATA:
        foreach my $g ( @{$data} ) {
            foreach my $h ( @{$states} ) {
                next DATA if $h eq $g;
            }
            croak __PACKAGE__,
" The element $g in the data is not represented in the given states";
        }
    }

#croak __PACKAGE__, ' At least two different values must be in the sequence to test its sub-sequences' if $t <= 1;
    return $states;
}

sub _frequencies {
    my ( $data_i, $v_i, $states_aref ) = @_
      ; # list all possible combinations of states at the current length ($v_i):
        # Count up the frequency of each variation in the data:
    my $num = scalar( @{$data_i} );
    my ( $i, $probe_str, $test_str, %r_freq ) = ();
    foreach ( variations_with_repetition( $states_aref, $v_i ) ) {
        $probe_str = join '', @{$_};
        $r_freq{$probe_str} = 0;
        for ( $i = 0 ; $i < _count_lim( $num, $v_i ) ; $i++ ) {
            $test_str = join '', @{$data_i}[ $i .. ( $i + $v_i - 1 ) ];
            $r_freq{$probe_str}++ if $probe_str eq $test_str;
        }
    } #    print "FREQ:\n"; while (my($key, $val) = each %freq) { print "\t$key = $val\n"; } print "\n";
    return \%r_freq;
}

sub _sel_psisq_and_df {    # get psisq and its df
    my ( $v, $t, $psisq_v, $delta ) = @_;
    my ( $psisq, $df ) = ();
    if ( $v == 1 )
    { # psisq is asymptotically distributed chisq, can use psisq for chisq distribution:
        $psisq = $psisq_v->{1};
        $df    = $t - 1;
    }
    else {
        if ( $delta == 0 ) {    # Raw psisq:
            $psisq = $psisq_v->{$v};
            $df = $t**$v - 1;    # Good (1957, Eq. 6) - if circularized or not
        }
        elsif ( $delta == 1 ) {    # First backward difference:
            $psisq =
              $psisq_v->{$v} - ( $v - 1 <= 0 ? 0 : $psisq_v->{ $v - 1 } );
            $df = $t**$v - $t**( $v - 1 );
        }
        else {    # $delta == 2 # Second backward difference (default):
            $psisq =
              $psisq_v->{$v} -
              ( 2 * ( $v - 1 <= 0 ? 0 : $psisq_v->{ $v - 1 } ) ) +
              ( $v - 2 <= 0 ? 0 : $psisq_v->{ $v - 2 } );
            $df = ( $t**( $v - 2 ) ) * ( $t - 1 )**2;
        }
    }
    return ( $psisq, $df );
}

sub _get_v_ari {
    my $v = shift
      ; # Init a hash to keep the psi-square values for the v, v-1, and v-2 ( = $v_i) sequence lengths, where relevant:
    my @ari = ();
    foreach ( 0 .. 2 ) {
        $v > $_ ? push @ari, $v - $_ : last;
    } # print "v ari = ", join(' ', @ari), "\n";  #push @ari, $v - 1 if $v >= 2; #push @ari, $v - 2 if $v >= 3;
    return \@ari;
}

# Compute psi^2 for uncircularized sequence from freq of each variation of length $v for given states (Good, 1953, Eq. 1):
sub _psisq_uncirc {
    my ( $n, $t, $v, $obs ) = @_;
    my $exp = ( $n - $v + 1 ) / $t**$v;
    return _sum_sq($obs, $exp) / $exp;
}

# Compute psi^2 for circularized sequence from freq of each variation of length $v for given states
# - Good (1953, Eq. 2) is in error by not squaring the (obs - exp) difference in the summation:
sub _psisq_circ {
    my ( $n, $t, $v, $obs ) = @_; 
    my $exp = $n * $t**( -1 * $v );
    return _sum_sq($obs, $exp) / $exp; # same as ($t**$v * $sum )/ $n;
}

sub _sum_sq {
    my ($obs, $exp) = @_;
    return sum( map { ( $_ - $exp )**2 } @{$obs} ); # Good & Gover (1967, p. 103)
}

__END__

=head1 EXAMPLE

=head2 Seating at the diner

This is the data from Swed and Eisenhart (1943) also given as an example for the L<Runs test|Statistics::Sequences::Runs/EXAMPLE> and L<Turns test|Statistics::Sequences::Turns/EXAMPLE>. It lists the occupied (O) and empty (E) seats in a row at a lunch counter.
Have people taken up their seats on a random basis - or do they show some social phobia (more sparesly seated than "chance"), or are they trying to pick up (more compactly seated than "chance")? What does Good's test of Vnomes reveal?

 use Statistics::Sequences::Vnomes;
 my $vnomes = Statistics::Sequences::Vnomes->new();
 my @seating = (qw/E O E E O E E E O E E E O E O E/); # data from Swed & Eisenhart (1943)
 $vnomes->load(\@seating); # as per Statistics::Data
 $vnomes->dump_vals(delim => q{,}); # via Statistics::Data - prints E,O,E,E,O,E,E,E,O,E,E,E,O,E,O,E
 $vnomes->dump(
    length => 3,
    values => { psisq => 1, p_value => 1 },
    format => 'labline',
    flag => 1,
    precision_s => 3,
    precision_p => 3,
    circularize => 0,
    verbose => 1,
 );

This prints: 
 
 Vnomes (3): p_value = 0.044*, psisq = 6.250

That is, the observed frequency of each possible trio of seating arrangements (the trinomes OOO, OOE, OEE, EEE, etc.) differed significantly from that expected. Look up the observed frequencies for each possible trinome to see if this is because there are more empty or occupied neighbouring seats ("phobia" or "philia"):

 $vnomes->dump(length => 3, values => {observed => 1}, format => 'labline');

This prints:

 observed = ('EEO' = 4,'EOO' = 0,'OEO' = 1,'OOO' = 0,'EOE' = 5,'EEE' = 2,'OEE' = 4,'OOE' = 0)

As the chance-expected frequency is 2.5 (from the L<expected|Statistics::Sequences::Vnomes/expected> method), there are clearly more than expected trinomes involving empty seats than occupied seats - suggesting a non-random factor like social phobia (or body odour?) is at work in sequencing people's seating here. Noting that the sequencing isn't significant for dinomes (with B<length> => 2) might also tell us something about what's going on. What happens for v-nomes of 4 or more in length? Maybe the L<runs|Statistics::Sequences::Runs> or L<pot|Statistics::Sequences::Pot> test might be a better summary of what's going on.

=head1 DIAGNOSTICS

=over 2

=item v-nome length needs to be defined and greater than zero

C<croak>ed from various methods (including observed, expected, variance and psisq) if the argument B<length> is not defined or if it defined but equals zero.

=back

=head1 REFERENCES

Davis, J. W., & Akers, C. (1974). Randomization and tests for randomness. I<Journal of Parapsychology>, I<38>, 393-407.

Delucchi, K. L. (1993). The use and misuse of chi-square: Lewis and Burke revisited. I<Psychological Bulletin>, I<94>, 166-176. doi: L<10.1037/0033-2909.94.1.166|http://dx.doi.org/10.1037/0033-2909.94.1.166>

Gatlin, L. L. (1979). A new measure of bias in finite sequences with applications to ESP data. I<Journal of the American Society for Psychical Research>, I<73>, 29-43. (Used in reference tests.)

Good, I. J. (1953). The serial test for sampling numbers and other tests for randomness. I<Mathematical Proceedings of the Cambridge Philosophical Society>, I<49>, 276-284.

Good, I. J. (1957). On the serial test for random sequences. I<Annals of Mathematical Statistics>, I<28>, 262-264. doi: L<10.1214/aoms/1177707053|http://dx.doi.org/10.1214/aoms/1177707053>

Good, I. J., & Gover, T. N. (1967). The generalized serial test and the binary expansion of [square-root]2. I<Journal of the Royal Statistical Society A>, I<130>, 102-107.

Kendall, M. G., & Babington Smith, B. (1938). Randomness and random sampling numbers. I<Journal of the Royal Statistical Society>, I<101>, 147-166.

Knuth, D. E. (1998). I<The art of computer programming> (3rd ed., Vol. 2 Seminumerical algorithms). Reading, MA, US: Addison-Wesley.

Rukhin, A., Soto, J., Nechvatal, J., Smid, M., Barker, E., Leigh, S., et al. (2010). A statistical test suite for random and pseudorandom number generators for cryptographic applications. Retrieved September 4 2010, from L<http://csrc.nist.gov/groups/ST/toolkit/rng/documents/SP800-22b.pdf>, and July 17, 2013, from L<http://csrc.nist.gov/publications/nistpubs/800-22-rev1a/SP800-22rev1a.pdf|SP800-22rev1a.pdf> (revised).

=head1 SEE ALSO (RELATION to NGRAMS)

I<V>-nomes are much the same as Ngrams (or N-grams), except that Ngram analysis is usually defined for whole strings, i.e., where all elements in the string are "chunked" as a single unit of analysis, and then often without respect to how they occur within a broader sequence of events.

Conversely, I<v>-nomes are based on respecting the frequency of each unit of analysis (each element in the "chunk"), and these units are analysed with respect to the sequence from which they're derived, as an ordered list of events.

There are several Perl modules for working with I<N>-gram strings, and others that work with I<v>-nome lists, for one or more orders of sequence (i.e., I<N> or I<v> lengths). For example ... 

L<Algorithm::NGram|Algorithm::NGram> works with space-delimited strings to produce a I<n>-gram frequency table, among other things.

L<Lingua::EN::Ngram|Lingua::EN::Ngram> extracts I<n>-grams from texts, and lists them according to frequency and/or I<T>-Score.

L<Text::Ngram|Text::Ngram>

L<Text::Ngrams|Text::Ngrams>

L<Text::NSP|Text::NSP>

L<Statistics::Frequency|Statistics::Frequency> provides frequencies for mononomes (only) within a list.

L<Statistics::Gtest|Statistics::Gtest> calculates the likelihood ratio (I<G>-square) statistic (which is I<asymptotically> distributed chi-square, whereas Good's delta^2psi^2 is calculated (by circularization and differencing) to be precisely chi^2 distributed).

L<Statistics::Lite|Statistics::Lite> provides frequencies for mononomes (only) via its C<frequencies> method.

L<Statistics::Sequences|Statistics::Sequences> has as its sub-modules other tests of sequences--e.g., L<Wald's Runs test|Statistics::Sequences::Runs>, L<Schmidt's Pot test|Statistics::Sequences::Pot> (of clustering, bunching of events in a sequence), L<Kendall's Turns test|Statistics::Sequences::Turns>--and supports sharing sequences as data between these tests.

=head1 AUTHOR/LICENSE

=over 4

=item Copyright (c) 2006-2016 Roderick Garton

rgarton AT cpan DOT org

This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see L<http://www.perl.com/perl/misc/Artistic.html>).

=back

=head1 DISCLAIMER

To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.

=cut