Statistics::ANOVA - Parametric and nonparametric 1-way analyses of variance for means-comparison and clustering per differences/trends over independent or repeated measures of variables or levels
This is documentation for Version 0.13 of Statistics::ANOVA.
use Statistics::ANOVA 0.13; my $aov = Statistics::ANOVA->new(); # Some data: my @gp1 = (qw/8 7 11 14 9/); my @gp2 = (qw/11 9 8 11 13/); my $res; # each anova method returns hash of F-value, p-value, ss_b, ss_w, etc., where relevant # Load the data: $aov->load_data({1 => \@gp1, 2 => \@gp2}); # NB: hashref # or $aov->load_data([ [1, \@gp1], [2, \@gp2] ]); # or $aov->load_data([ [1, @gp1], [2, @gp2] ]); my @gp3 = (qw/7 13 12 8 10/); $aov->add_data(3 => \@gp3); # Test equality of variances before omnibus comparison: %res = $aov->obrien()->dump(title => 'O\'Brien\'s test of equality of variances'); %res = $aov->levene()->dump(title => 'Levene\'s test of equality of variances'); # 1.10 Independent nominal variables ANOVA - parametric testing: %res = $aov->anova(independent => 1, parametric => 1)->dump(title => 'Indep. variables parametric ANOVA', eta_squared => 1, omega_squared => 1); # 1.11 Independent nominal variables (groups) ANOVA - NON-parametric: %res = $aov->anova(independent => 1, parametric => 0)->dump(title => 'Kruskal-Wallis test'); # or if independent AND ordered variables (levels): test linear/non-linear trend: # 1.20 Independent ordinal variables ANOVA - parametric testing: %res = $aov->anova(independent => 1, parametric => 1, ordinal => 1)->dump(title => 'Indep. variables parametric ANOVA: Linear trend'); %res = $aov->anova(independent => 1, parametric => 1, ordinal => 2)->dump(title => 'Indep. variables parametric ANOVA: Non-linear trend'); # 1.21 Independent ordinal variables ANOVA - NONparametric testing: %res = $aov->anova(independent => 1, parametric => 0, ordinal => 1)->dump(title => 'Jonckheere-Terpstra test'); # If they are repeated measures: # 2.10 Dependent nominal variables ANOVA - parametric testing: %res = $aov->anova(independent => 0, parametric => 1)->dump(title => 'Dependent variables ANOVA'); # 2.11 Dependent nominal variables ANOVA - NONparametric testing: %res = $aov->anova(independent => 0, parametric => 0, f_equiv => 0)->dump(title => 'Friedman test'); # or if repeated AND ordinal measures: # 2.20 Dependent ordinal variables ANOVA - parametric testing: NOT IMPLEMENTED #$aov->anova(independent => 0, parametric => 1)->dump(title => ''); # 2.21 Dependent ordinal variables test - NONparametric testing: %res = $aov->anova(independent => 0, parametric => 0, ordinal => 1, f_equiv => 0)->dump(title => 'Page test'); # Get pairwise comparisons (nominality of the factor assumed): $aov->compare(independent => 1, parametric => 1, flag => 1, alpha => .05, dump => 1); # Indep. obs. F- (or t-)tests $aov->compare(independent => 0, parametric => 1, flag => 1, alpha => .05, dump => 1); # Paired obs. F (or t-)tests $aov->compare(independent => 1, parametric => 0, flag => 1, alpha => .05, dump => 1); # Wilcoxon (between-variables) sum-of-ranks (Dwass Procedure) $aov->compare(independent => 0, parametric => 0, flag => 1, alpha => .05, dump => 1); # Friedman-type (within-variables) sum-of-ranks print $aov->table(precision_p => 3, precision_s => 3); $aov->unload('g3'); # back to 2 datasets (g1 and g2)
"If your predictor variables are categorical (ordered or unordered) and your response variables are continuous, your design is called an ANOVA (for ANalysis Of VAriance"—Gotelli & Ellison (2004, p. 171).
With that idea in mind, in order to actually perform an ANOVA, you really only need to define an analysis as based on (1) ordered or unordered predictors, (2) independent or repeated measurement of their effects on response variables (i.e., from different or the same data-sources), and then (3) whether parametric or nonparametric assumptions can be made about how the factors impact on the response variables. This module facilitates selecting the right type of ANOVA, by a mere true/false setting of three arguments based on the three latter concepts-- attempting to meet just about every possible combination of these analysis specs. More specifically ...
By setting the Boolean (0, 1) value of three parameters (independent, parametric and ordinal), this module returns and memorizes results from oneway parametric or non-parametric analyses-of-variance (ANOVAs) for either nominal groups or ordinal levels of an independent factor, and for either independent or dependent (repeated measures) observations within each group/level of that factor.
Parametric tests are of the traditional Fisher-type. Non-parametric tests comprise the Kruskal-Wallis, Jonckheere-Terpstra, Friedman and Page tests; all rank-based tests (with default accounting for ties in ranks).
Other, related routines are offered: for parametrically testing equality of variances (O'Brien and Levene tests); for estimating proportion of variance accounted for (eta-squared) and effect-size (omega-squared); and for making some rudimentary a priori pairwise comparisons by independent/dependent t-tests.
Reliability of the implemented methods has been tested against at least two different published exemplars of the methods; and by comparing output with one or another open-source or commercial statistics package. That this module's stats and tests match these examplars is tested during installation (at least via CPAN, or when making a "manual" installation).
The API has been stable over all versions, but, ahead of versioning to 1.0, it might well be expected to change. News of method unreliabilities and/or limitations are welcome. Ones from Cathal Seoghie re version 0.01, and Patrick H. Degnan re version 0.07, have already helped this module's development.
Object-oriented. No subs are explicitly exported, no arguments are set for cross-method application. The class-object holds the myriad of statistics produced by the last ANOVA run.
$aov = Statistics::ANOVA->new()
Create a new Statistics::ANOVA object for accessing the subs.
$aov->load('aname', @data1) $aov->load('aname', \@data1) $aov->load(['aname', @data1], ['another_name', @data2]) $aov->load(['aname', \@data1], ['another_name', \@data2]) $aov->load({'aname' => \@data1, 'another_name' => \@data2})
Alias: load_data
Accepts data for analysis in any of the above-shown forms, but always with the requirement that:
The reason for these options is that there are as many as it is practically and intuitively possible to make in Perl's Statistics modules that it's a cost and pain to traverse them; so multiple structures are permitted.
provide name => value
pairs of data keyed by a stringy name, each with referenced array of values.
a reference to an array of referenced arrays, where each of the latter arrays consists of a sample name occupying the first index, and then its sample data, as an array or yet another referenced array; e.g., [ ['group A', 20, 22, 18], ['group B', 18, 20, 16] ]
a hash reference of named array references of data. This is the preferred method - the one that is first checked in the elongated if
clause that parses all this variety.
The data are loaded into the class object by name, within a hash named data
, as flat arrays. So it's all up to you then what statistics and how follow from using this package.
The names of the data are up to you, the user; whatever can be set as the key in a hash. But if you intend to do trend analysis, you should, as a rule, give only numerical names to your groups/levels, defining their ordinality (with respect to the limitations on algorithms presently offered for trend analysis).
Each call unloads any previous loads.
Returns the Statistics::ANOVA object - nothing but its blessed self.
$aov->add('another_name', \@data2); $aov->add(['another_name', \@data2]); $aov->add({'another_name' => \@data2});
Same as load except that any previous loads are not unloaded. Again, the hash-referenced list is given preferential treatment.
$aov->unload() # bye to everything $aov->unload('g1') # so long data named "g1"
Alias: delete_data
With nil or no known arguments, empties all cached data and calculations upon them, ensuring these will not be used for testing. This will be automatically called with each new load, but, to take care of any development, it could be good practice to call it yourself whenever switching from one dataset for testing to another.
Alternatively, supply one or more names of already loaded data to clobber just them out of existence; preserving any other loads.
Any data-points/observations sent to load or add that are undefined or not-a-number are marked for purging before being anova-tested or tested pairwise. The data arrays accessed as above, will still show the original values. When, however, you call one of the anova or pairwise methods, the data must and will be purged of these invalid values before testing.
When the independent
parameter equals 1 when sent to anova or compare, each list is simply purged of any undefined or invalid values. This also occurs for the equality of variances tests.
When independent
parameter equals 0 when sent to anova and compare, each list is purged of any value at all indices that, in any list, contain invalid values. So if two lists are (1, 4, 2) and (2, ' ', 3), the lists will have to become (1, 2) and (2, 3) to account for the bung value in the second list, and to keep all the observations appropriately paired.
The number of indices that were subject to purging is cached thus: $aov->{'purged'}. The dump method can also reveal this value.
The looks_like_number
method in Scalar::Util is used for checking validity of values. (Although Params::Classify::is_number might be stricter, looks_like_number benchmarks at least a few thousand %s faster.)
One generic method anova (a.k.a. aov, test) is used to access the possible combitinations of parametric or nonparametric tests, for independent or dependent/related observations, and for categorical or ordinal analysis. Accessing the different statistical tests depends on setting three parameters on a true/false basis: independent, parametric and ordinal.
The attribute independent
refers to whether or not each level of the variable was yielded by independent or related sources of data; e.g., If the same people provided you with responses under the various factors, or if the factors were tested by different participants apiece; when respectively independent
=> 0 or 1.
The following describes the particular tests you get upon each possible combination of these alternatives.
%res = $aov->anova(independent => 1, parametric => 1, ordinal => 0)
Offers the standard Fisher-type ANOVA for independent measures of the different levels of a factor.
$aov->anova(independent => 1, parametric => 1, ordinal => 1) # test linear trend $aov->anova(independent => 1, parametric => 1, ordinal => -1) # test non-linear trend
If the independent/treatment/between groups variable is actually measured on a continuous scale/is a quantitative factor, assess their linear trend: Instead of asking "How sure can we be that the means-per-group are equal?", ask "How sure can we be that there is a departure from flatness of the means-per-level?".
The essential difference is that in place of the the between (treatment) mean sum-of-squares in the numerator is the linear sum of squares in which each "group" mean is weighted by the deviation of the level-value (the name of the "group") from the mean of the levels (and divided by the sum of the squares of these deviations).
If the number of observations per level is unequal, the module applies the simple unweighted approach. This is recommended as a general rule by Maxwell and Delaney (1990), given that the weighted approach might erroneously suggest a linear trend (unequal means) when, in fact, the trend is curvilinear (and by which the means balance out to equality); unless "there are strong theoretical reasons to believe that the only true population trend is linear" (p. 234). (But then you might be theoretically open to either. While remaining as the default, a future option might access the hierarchical, weighted approach.)
To test if there is the possibility of a non-linear trend, give the value of -1 to the ordinal
argument.
Note that the contrast coefficients are calculated directly from the values of the independent variable, rather than using a look-up table. This respects the actual distance between values, but requires that the names of the sample data, of the groups (or levels), are numerical names when loaded - i.e., such that the data-keys can be summed and averaged.
%res = $aov->anova(independent => 1, parametric => 0, ordinal => 0)
Performs a one-way independent groups ANOVA using the non-parametric Kruskal-Wallis sum-of-ranks method for a single factor with 2 or more levels. By default, instead of an F-value, there is an H-value. The p-value is read off the chi-square distribution. The test is generally considered to be unreliable if there are no more than 3 groups and all groups comprise 5 or fewer observations. An estimate of F can, alternatively be returned, if the optional argument f_equiv => 1.
By default, this method accounts for and corrects for ties in ranks across the levels, but if correct_ties
= 0, H is uncorrected. The correction involves giving each tied score the mean of the ranks for which it is tied (see Siegal, 1956, p. 188ff).
$aov->anova(independent => 1, parametric => 0, ordinal => 1)
Performs the Jonckheere-Terpstra nonparametric test for independent but ordered levels. The method returns:
$res{'j_value'} : the observed value of J $res{'j_exp'} : the expected value of J $res{'j_var'} : the variance of J $res{'z_value'} : the normalized value of J $res{'p_value'} : the one-tailed probability of observing a value as great as or greater than z_value.
%res = $aov->anova(independent => 0, parametric => 1, ordinal => 0, multivariate => 0|1)
Performs parametric repeated measures analysis of variance. This uses the traditional univariate, or "mixed-model," approach, with sphericity assumed (i.e., equal variances of all factor differences, within each factor and all possible pairs of factors). The assumption is met when there are only two levels of the repeated measures factor; but unequal variances might be a problem when there are more than two levels. No methods are presently applied to account for the possibility of non-sphericity.
[Not implemented.]
%res = $aov->anova(independent => 0, parametric => 0, ordinal => 0)
Performs the Friedman nonparametric analysis of variance - for two or more dependent (matched, related) groups. The statistical attributes now within the class object (see anova) pertain to this test, e.g., $aov->{'chi_value'} gives the chi-square statistic from the Friedman test; and $aov->{'p_value'} gives the associated p-value (area under the right-side, upper tail of the distribution). If f_equiv => 1, then, instead of the chi-value, and p-value read off the chi-square distribution, you get the F-value equivalent, with the p-value read off the F-distribution.
%res = $aov->anova(independent => 0, parametric => 0, ordinal => 1, tails => 1|2)
This implements the Page (1963) analysis of variance by ranks for repeated measures of ordinally scaled variables; so requires - numerically named variables. The statistical attributes now within the class object (see anova) pertain to this test, and are chiefly:
$res{'l_value'} : the observed test statistic (sum of ordered and weighted ranks) $res{'l_exp'} : expected value of the test statistic $res{'l_var'} : variance of the test statistic (given so many groups and observations) $res{'z_value'} : the standardized l_value $res{'p_value'} : the 2-tailed probability associated with the z_value (or 1-tailed if tails => 1). $res{'r_value'} : estimate of the Spearman rank-order correlation coefficient based on the observed and predicted order of each associated variable per observation.
$aov->anova(independent => 1|0, parametric => 1|0, ordinal => 0|1)
Aliases: aov, test
Generic method to access all anova functions by specifying TRUE/FALSE values for independent
, parametric
and ordinal
.
Independent Parametric Ordinal What you get 1 1 0 Fisher-type independent groups ANOVA 1 1 1 Fisher-type independent groups ANOVA with trend analysis 1 0 0 Kruskal-Wallis independent groups ANOVA 1 0 1 Jonckheere-Terpstra independent groups trend analysis 0 1 0 Fisher-type dependent groups ANOVA (univariate or multivariate) 0 1 1 (Fisher-type dependent groups ANOVA with trend analysis; not implemented) 0 0 0 Friedman's dependent groups ANOVA 0 0 1 Page's dependent groups trend analysis
All methods return nothing but the class object after feeding it with the relevant statistics, which you can access by name, as follows:
$res{'f_value'} (or $res{'chi_value'}, $res{'h_value'}, $res{'j_value'}, $res{'l_value'} and/or $res{'z_value'}) $res{'p_value'} : associated with the test statistic $res{'df_b'} : between-groups/treatment/numerator degree(s) of freedom $res{'df_w'} : within-groups/error/denominator degree(s) of freedom (also given with F-equivalent Friedman test) $res{'ss_b'} : between-groups/treatment sum of squares $res{'ss_w'} : within-groups/error sum of squares $res{'ms_b'} : between-groups/treatment mean squares $res{'ms_w'} : within-groups/error mean squares
$aov->obrien()
Alias: obrien_test
Performs O'Brien's (1981) test for equality of variances within each variable: based on transforming each observation in relation to its variance and its deviation from its mean; and performing an ANOVA on these values (for which the mean is equal to the variance of the original observations). The procedure is recognised to be robust against violations of normality (unlike F-max) (Maxwell & Delaney, 1990).
The statistical attributes now within the class object (see anova) pertain to this test, e.g., $aov->{'f_value'} gives the F-statistic for O'Brien's Test; and $aov->{'p_value'} gives the p-value associated with the F-statistic for O'Brien's Test.
$aov->levene()
Alias: levene_test
Performs Levene's (1960) test for equality of variances within each variable: an ANOVA of the absolute deviations, i.e., absolute value of each observation less its mean.
The statistical attributes now within the class object (see anova) pertain to this test, e.g., $aov->{'f_value'} gives the F-statistic for Levene's Test; and $aov->{'p_value'} gives the p-value associated with the F-statistic for Levene's Test.
Follow-up parametric ANOVAs. Note that for the one-way ANOVAs here tested, eta-squared is the same as partial eta-squared.
$etasq = $aov->eta_squared(independent => BOOL, parametric => BOOL, ordinal => BOOL);
Returns the effect size estimate (partial) eta-squared, calculated using sums-of-squares via Statistics::ANOVA::EffectSize. Also feeds $aov with the value, named 'eta_sq'.
Returns the effect size estimate (partial) omega-squared, calculated using mean sums-of-squares via Statistics::ANOVA::EffectSize. Also feeds $aov with the value, named 'omega_sq'.
$aov->compare(independent => 1|0, parametric => 1|0, tails => 2|1, flag => 0|1, alpha => .05, adjust_p => 0|1, adjust_e => 1|0|2, use_t => 0|1, dump => 0|1, str => 0|1)
Performs all possible pairwise comparisons, with the Bonferroni approach to control experiment-wise error-rate. The particular tests depend on whether or not you want parametric (default) or nonparametric tests, and if the observations have been made independently (between groups, the default) or by repeated measures. See Statistics::ANOVA::Compare.
$itv_str = $aov->(independent => 1|0, alpha => .05, name => 'aname', limits => 0) # get interval for single variable as string $lim_aref = $aov->(independent => 1|0, alpha => .05, name => 'aname', limits => 1) # get upper & lower limits for single variable as aref $itv_href = $aov->(independent => 1|0, alpha => .05, name => ['aname', 'bname'], limits => 0) # get interval for 2 variables as hashref keyed by variable names $lim_href = $aov->(independent => 1|0, alpha => .05, name => ['aname','bname'], limits => 1) # get upper & lower limits for 2 variables as hashref of variable-named arefs $itv_href = $aov->(independent => 1|0, alpha => .05, name => undef, limits => 0) # get intervals for all variables as hashref keyed by variable names $lim_href = $aov->(independent => 1|0, alpha => .05, name => undef, limits => 1) # upper & lower limits for all variables as hashref
Computes confidence intervals using (by default) the pooled estimate of variability over groups/levels, rather than the standard error within each group/level, as described by Masson and Loftus (2003). For a between groups design, the confidence interval (as usual) indicates that, at a certain level of probability, the true population mean is likely to be within the interval returned. For a within-subjects design, as any effect of the variability between subjects is eliminated, the confidence interval (alternatively) indicates the reliability of the how the sample means are distributed as an estimate of the how the population means are distributed.
In either case, there is an assumption that the variances within each condition are the same between the conditions (homogeneity of variances assumption).
Actual algorithm depends on whether the measures are obtained from indepedently (between-groups) (independent => 1) or by repeated measures (independent => 0) (i.e., whether between-groups or within-groups design). Default is between-groups.
The option use_mse
can be set to equal 0 so that the (typical) standard error of the mean is used in place of the mean-square error. This is one option to use when the variances are unequal.
The option conditions
can, optionally, include a referenced array naming the particular conditions that should be included when calculating MSe. By default, this is all the conditions, using MSe from the omnibus ANOVA. This is one option to handle the case of unequal variances between conditions.
$str = $aov->string(mse => 1, eta_squared => 1, omega_squared => 1, precision_p => integer, precision_s => integer)
Returns a statement of result, in the form of F(df_b, df_w) = f_value, p = p_value
; or, for Friedman test chi^2(df_b) = chi_value, p = p_value
(to the value of precision_p, if any); and so on for other test statistics. Optionally also get MSe, eta_squared and omega_squared values appended to the string, where relevant. These and the test statistic are "sprintf"'d to the precision_s specified (or, by default, not at all).
$table = $aov->table(precision_p => integer, precision_s => integer);
Returns a table listing the degrees of freedom, sums of squares, and mean squares for the tested "factor" and "error" (between/within variables), and the F- and p-values. The test statistics are "sprintf"'d to the precision_s specified (or, by default, not at all); the p value's precision can be specified by precision_p.
Up to this version, if calculating any of these values was not essential to calculation of the test statistic, the value will simply appear as a blank in the table. If the omnibus test last made was non-parametric, and no F-value was calculated, then the table returned is entirely an empty string.
Formatting with right-justification where appropriate is left for user-joy.
$aov->dump(title => 'ANOVA test', precision_p => integer, precision_s => integer, mse => 1, eta_squared => 1, omega_squared => 1, verbose => 1)
Prints the string returned by string, or, if specified with the attribute table => 1, the table returned by table; and the string as well if string => 1. A newline - "\n" - is appended at the end of the print of the string. Above this string or table, a title can also be printed, by giving a value to the optional title
attribute.
If verbose => 1, then any curiosities arising in the calculations are noted at the end of other dumps. At the moment, this is only the number of observations that might have been purged were they identified as undefined or not-a-number upon loading/adding.
$ss_tot = $aov(independent => BOOL, ordinal => BOOL); ($ss_tot, $s_b, $ss_w) = $aov(independent => BOOL, ordinal => BOOL);
Returns the total sum-of-squares, being the sum of the between- and within-groups sums-of-squares, and so definable as the "corrected" total sum-of-squares. Called in array context, also returns the between- and within-groups sums-of-squares themselves.
$ss_b = $anova->ss_b(independent => BOOL, ordinal => BOOL);
Returns the between-groups (aka treatment, effect, factor) sum-of-squares for the given data and the independence of the groups, and whether or not they have an ordinal relationship.
$ss_w = $anova->ss_w(independent => BOOL);
Returns the within-groups (aka error) sum-of-squares for the given data and according to whether the data per group are independent or dependent.
$mean = $anova->grand_mean();
Returns the mean of all observations.
$sum = $anova->grand_sum($data);
Returns the sum of all observations.
$count = $anova->grand_n();
Returns the number of all observations.
Initialising an alpha-value for significance-testing was done but not with a valid value; it must be a probablity, but less than 1 and greater than 0.
Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York, US: Academic.
Hollander, M., & Wolfe, D. A. (1999). Nonparametric statistical methods. New York, NY, US: Wiley.
Levene, H. (1960). Robust tests for equality of variances. In I. Olkins (Ed.), Contributions to probability and statistics. Stanford, CA, US: Stanford University Press.
Masson, M. E. J., & Loftus, G. R. (2003). Using confidence intervals for graphically based data interpretation. Canadian Journal of Experimental Psychology, 57, 203-220.
Maxwell, S. E., & Delaney, H. D. (1990). Designing experiments and analyzing data: A model comparison perspective. Belmont, CA, US: Wadsworth.
O'Brien, R. G. (1981). A simple test for variance effects in experimental designs. Psychological Bulletin, 89, 570-574.
Siegal, S. (1956). Nonparametric statistics for the behavioral sciences. New York, NY, US: McGraw-Hill
Math::Cephes Probabilities for all tests are computed using this module's functions, rather than the "in-house" Statistics::Distributions module, as the former appears to be more accurate for larger values of F.
Statistics::ANOVA::Cluster for determining parametric and nonparametric variable clusters by ANOVA.
Statistics::ANOVA::EffectSize for returning eta- and omega-squared.
Statistics::Data : used as base
.
Statistics::FisherPitman For an alternative to independent groups ANOVA when the variances are unequal.
Statistics::KruskalWallis Offers Newman-Keuls for pairwise comparison by ranks. Also offers non-parametric independent groups ANOVA, but note it does not handle ties in rank occurring between two or more observations, nor correct for them; an erroneous H-value is calculated if ties exist in your data. Also does not handle missing/invalid values. Present module adapts its _grouped method.
Statistics::Table::F Simply returns an F value. Does not handle missing values, treating them as zero and thus returning an erroneous F-value in these cases.
Please report any bugs or feature requests to bug-statistics-anova-0.13 at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Statistics-ANOVA-0.13. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
You can find documentation for this module with the perldoc command.
perldoc Statistics::Data
You can also look for information at:
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Statistics-ANOVA-0.13
Roderick Garton, <rgarton at cpan.org>
Copyright 2006-2015 Roderick Garton
This program is free software. It may be used, redistributed and/or modified under the same terms as Perl-5.6.1 (or later) (see http://www.perl.com/perl/misc/Artistic.html).
To the maximum extent permitted by applicable law, the author of this module disclaims all warranties, either express or implied, including but not limited to implied warranties of merchantability and fitness for a particular purpose, with regard to the software and the accompanying documentation.