Statistics::OLS - perform ordinary least squares and associated statistics, v 0.07.
use Statistics::OLS; my $ls = Statistics::OLS->new(); $ls->setData (\@xydataset) or die( $ls->error() ); $ls->setData (\@xdataset, \@ydataset); $ls->regress(); my ($intercept, $slope) = $ls->coefficients(); my $R_squared = $ls->rsq(); my ($tstat_intercept, $tstat_slope) = $ls->tstats(); my $sigma = $ls->sigma(); my $durbin_watson = $ls->dw(); my $sample_size = $ls->size(); my ($avX, $avY) = $ls->av(); my ($varX, $varY, $covXY) = $ls->var(); my ($xmin, $xmax, $ymin, $ymax) = $ls->minMax(); # returned arrays are x-y or y-only data # depending on initial call to setData() my @predictedYs = $ls->predicted(); my @residuals = $ls->residuals();
I wrote Statistics::OLS to perform Ordinary Least Squares (linear curve fitting) on two dimensional data: y = a + bx. The other simple statistical module I found on CPAN (Statistics::Descriptive) is designed for univariate analysis. It accomodates OLS, but somewhat inflexibly and without rich bivariate statistics. Nevertheless, it might make sense to fold OLS into that module or a supermodule someday.
Statistics::OLS computes the estimated slope and intercept of the regression line, their T-statistics, R squared, standard error of the regression and the Durbin-Watson statistic. It can also return the residuals.
It is pretty simple to do two dimensional least squares, but much harder to do multiple regression, so OLS is unlikely ever to work with multiple independent variables.
This is a beta code and has not been extensively tested. It has worked on a few published datasets. Feedback is welcome, particularly if you notice an error or try it with known results that are not reproduced correctly.
use Statistics::OLS; my $ls = Statistics::OLS->new;
$ls->setData (\@xydata); $ls->setData (\@xdata, \@ydata);
The setData() method registers a two-dimensional dataset with the regression object. You can pass the dataset either as a reference to one flat array containing the paired x,y data or as references to two arrays, one each for the x and y data. [In either case, the data arrays in your script are not cached (copied into the object). If you alter your data, you may optionally call setData() again (if you want error checking--see below) but you should at least call the regress() method (see below) to recompute statistics for the new data. Or more simply, do not alter your data.]
As a single array, in your script, construct a flat array of the form (x0, y0, ..., xn, yn) containing n+1 x,y data points. Then pass a reference to the data array to the setData() method. (If you do not know what a reference is, just put a backslash (\) in front of the name of your data array when you pass it as an argument to setData().) Like this:
my @xydata = qw( -3 9 -2 4 -1 1 0 0 1 1 2 4 3 9); $ls->setData (\@xydata);
Or, you may find it more convenient to construct two equal length arrays, one for the horizontal and one for the corresponding vertical data. Then pass references to both arrays (horizontal first) to setData():
my @xdata = qw( -3 -2 -1 0 1 2 3 ); my @ydata = qw( 9 4 1 0 1 4 9 ); $ls->setData (\@xdata, \@ydata);
Error checking: The setData() method returns a postive integer on success and 0 on failure. If setData() fails, you can recover an error message about the failure with the error() method. The error string returned will either be one of
The data set does not contain an equal number of x and y values. The data element ... is non-numeric. The data set must contain at least three points.
In your script, you could test for errors like:
$ls->setData (\@data) or die( $ls->error() );
In the current version, only numerals, decimal points (apologies to Europeans), scientific notation (1.7E10) and minus signs are permitted. Currencies ($), time (11:23am) or dates (23/05/98) are not yet supported and will generate errors. I may figure these out someday.
$ls->regress() or die ( $ls->error() );
This performs most of the calculations. Call this method after setting the data, but before asking for any regressions results. If you change your data, previous calculations will generallly be inaccurate, so you should call this method again. The regress() method returns 1 on success, The only error message is
No datset has been registered.
although a number of undef results (due to divide by zero errors) may be returned in specific statistics below.
my ($intercept, $slope) = $ls->coefficients(); my $R_squared = $ls->rsq(); my ($tstat_intercept, $tstat_slope) = $ls->tstats(); my $sigma = $ls->sigma(); my $durbin_watson = $ls->dw(); my $sample_size = $ls->size(); my ($avX, $avY) = $ls->av(); my ($varX, $varY, $covXY) = $ls->var(); my ($xmin, $xmax, $ymin, $ymax) = $ls->minMax();
Call these methods only after you have called regress(). Most of these should be familiar from any econometrics text. If the slope is infinite (variance of X is zero) it is set to undef. R-squared is 1.0 if the sample variances of either X or Y are zero (or the data are colinear). If the variance of X is zero, both T statistics are set to undef. sigma is an estimate of the homoscedastic standard deviation of the error term, also known as the standard error of the estimate. The variances use n-1. Durbin-Watson returns undef if the data are colinear.
my @predictedYs = $ls->predicted(); my @residuals = $ls->residuals();
Call these methods only after you have called regress(). Both methods return data arrays, in the same format you used in setData(). If the data was passed to setData() as a reference to an @xydata array of the form (x0, y0, ..., xn, yn), then the results of these methods will be of this same form, except that the y values will either be the predicted y based on the coefficient estimates, or the residual error of that predicted y from the observed value of y.
If the data was passed as references to two arrays, @xdata = (x0 ... xn) and @ydata = (y0 ... yn), then the results of these two methods will be a single array of y type data, either the predicted y or residual error. The original x data array will still correspond to these result arrays.
This module is beta code, so it is not guaranteed to work right. I have not exhaustively tested it.
Possible future work includes support for other data formats, such as date, time and currency.
Generalization to multiple regression is probably not in the cards, since it is more than an order of magnitude more difficult. Better to use something Fortran based or maybe the Perl Data Language.
It would make sense to fold this into Statistics::Descriptive as a more comprehensive library, perhaps called
libstats. But that might not happen soon, since it sounds like a big project.
Comments and bug reports are welcome.
Copyright (c) 1998 by Sanford Morton, firstname.lastname@example.org. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This work is dedicated to the memory of Dr. Andrew Morton, who requested it. Requiescat in pace, my friend.
The Statistics::Descriptive(1) module performs useful univariate statistics and is well tested. The Perl Data Language (see CPAN) may also prove useful for statistics.
Simple linear regression is discussed in all econometrics and most probablility and statistics texts. I used E<Basic Econometrics> 2nd ed., by Gujaratii, New York: McGraw-Hill,1988, for most of the formulas and the test example (appendix 3A.6, page 87).