The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Microarray::DataMatrix - abstraction to matrices of microarray data

Abstract and Overall Logic

Note : This documentation is for Developers only. Clients of concrete subclasses of this package should have no need to consult this documentation, as the API for those subclasses should be fully documented as part of those subclasses.

dataMatrix provides an abstract superclass for a collection of abstract classes pertaining to dealing with matrices. Only in the context of the those other classes is baseDataMatrix useful and meaningful. baseDataMatrix itself provides protected methods for certain primitive operations that can be used by its subclasses, and public methods for which it is required that its immediate subclasses have the same underlying structure to deal with their dataMatrix, such as which rows and columns have not yet been filtered out.

The collection of classes are structured like this:

                    dataMatrix

                        /\
                       /  \
                      /    \
                ISA  /      \ ISA
                    /        \
                   /          \
                  /            \
        smallDataMatrix  bigDataMatrix
                \               /
                 \             /
                  \           /
                   \         /
           CanBeA   \       /  CanBeA
                     \     /
                      \   /
                       \ /
                 anySizeDataMatrix

                        |
                        | ISA
                        |
                ------------------  -  -  -  -  -  -
                |               |                  |
                |               |                  |
                |               |                  |
        concreteClassA  concreteClassB        concreteClassX

anySizeDataMatrix provides an abstraction to a dataMatrix whose contents may or may not fit into memory. An object will inherit dynamically, at construction time, from either small- or bigDataMatrix, which know how to deal with a matrix of a particular size. anySizeDataMatrix itself is an abstract class, and will be subclassed by concrete classes dealing with a particular file type of data, which they know how to parse, for example a pclFile. Because development of dataMatrix, smallDataMatrix, bigDataMatrix and anySizeDataMatrix was done as a collection of classes, they are somewhat more intimate with each other than say a concrete subclass of anySizeDataMatrix would be with anySizeDataMatrix itself. While the subclasses do stick to the API, and respect the privacy of attributes and methods, the API was developed simultaneously with the subclasses that were using it. Thus it may not be the cleanest API in the world.....

This collection of classes tries to follow the rules that all attributes are preceded by the "$PACKAGE::", in the objects hash. Private attribute names and private methods are preceded by two underscores, protected attributes and protected methods (which can be accessed by subclasses, as well as in $PACKAGE itself) are preceded by a single underscore. Public attributes and methods (which can be accessed anywhere) have no preceding underscores. In actuality, all object attributes are (and should be private). If there is a need for either subclasses or clients to manipulate or access them, then there are provided protected and public methods respectively, for setting or getting the values of the attributes. Disobey this interface at your peril!!!!

Protected Setter/Mutator Methods

_setAutoDump

This protected method is used to set the autodump flag, which can be either 1 or 0. This should only be utilized by subclasses, not clients.

Usage:

        $self->_setAutoDump(1);

_setValidColumns

This protected setter method receives a reference to a hash, which has as its keys the indexes of the columns in the matrix which are valid. This method MUST be used when the matrix has been first read, to set up all the columns which are initially valid (this call will actually occur in the _init methods (or methods called by them) of big- and smallDataMatrix). The values of the hash will usually be undef, to simply save space. There is no expectation for them to be otherwise.

Usage:

       $self->_setValidColumns(\%validColumns);

_setValidRows

This protected setter method receives a reference to a hash, which has as its keys the indexes of the rows in the matrix which are valid. This method is expected to only be used when the matrix has been first read, to set up all the rows which are initially valid (this call will actually occur in the _init methods (or methods called by them) of big- and smallDataMatrix). The values of the hash will usually be undef, to simply save space. There is no expectation for them to be otherwise.

Usage:

        $self->_setValidRows(\%validRows);

_setErrstr

This protected setter method accepts a scalar, that will correspond to an error that has occurred, and will store it within the object.

Usage:

        $self->_setErrstr($error);

_invalidateMatrixRow

This protected mutator method makes a row invalid. This method is not undoable, because the invalidation also deletes the data for the row. Note that the row index MUST correspond to the index of that row in the original file, not whatever row it may currently be (ie if rows 1 and 2 were filtered out, row 3 should still be called row 3 when being invalidated, not row 1).

Usage :

        $self->_invalidateMatrixRow($row);

_invalidateMatrixColumn

This protected mutator method makes a column invalid. Note that the column index MUST correspond to the index of that column in the original file, not whatever column it may currently be (ie if columns 1 and 2 were filtered out, column 3 should still be called column 3 when being invalidated, not column 1).

Usage :

      $self->_invalidateMatrixColumn($column);

PROTECTED GETTER/ACCESSOR METHODS

_autoDump

This protected method returns a boolean to indicate whether autodumping is enabled.

Usage:

        if ($self->_autoDump){ 

           # blah 

        }

_validRowsArrayRef

This protected accessor returns a reference to an array that contains the indexes of all the valid rows

Usage:

        foreach my $row (@{$self->_validRowsArrayRef}){

                # do something useful

        }

_validColumnArrayRef

This protected accessor returns a reference to an array that contains the indexes of all the valid columns.

Usage:

        foreach my $column (@{$self->_validColumnsArrayRef}){

                # do something useful

        }

_matrixRowIsValid

This protected accessor returns a boolean to indicate whether a given row in the data matrix is still valid (ie has not been filtered out). The row index is with respect to its index in the original file that was used to construct the object.

Usage :

      if ($self->_matrixRowIsValid($row)){ # blah }

_matrixColumnIsValid

This protected accessor returns a boolean to indicate whether a given column in the data matrix is still valid (ie has not been filtered out). The column index is with respect to its index in the original file that was used to construct the object.

Usage :

      if ($self->_matrixColumnIsValid($column)){ # blah }

_numColumnsToReport

This protected method returns the number of columns to process after which reporting should be done, if verbose reporting has been indicated. If no value has been set, then the default of 50 is returned.

Usage :

      my $numColumnsToReport = $self->_numColumnsToReport;

_numRowsToReport

This protected method returns the number of rows to process after which reporting should be done, if verbose reporting has been indicated. If no value has been set, then the default of 5000 is returned.

Usage :

      my $numRowsToReport = $self->_numRowsToReport;

_lineEnding

This protected method returns the appropriate line ending, for text or html reporting. It expects a string, either 'html' or 'text' and will return the appropriate line ending.

Usage:

        my $lineEnding = $self->_lineEnding("text");

_centeringMethodIsAllowed

This protected method returns a boolean to indicate whether a centering method is allowed. Allowed methods are 'mean' and 'median'.

Usage :

      if ($self->_centeringMethodIsAllowed($method)){ # blah }

_operatorIsAllowed

This protected method returns a boolean to indicate whether a particular operator is allowed. For each operator, there exists a corresponding method that uses that operator. Such operators are used when filtering rows by there values, eg >, or < etc.

Usage :

      if ($self->_operatorIsAllowed($operator)){ # blah }

_methodForOperator

This protected method returns the name of the method that is used to compare two values, based on the operator that was passed in.

Usage :

      my $method = $self->_methodForOperator($operator);

PROTECTED UTILITY METHODS

_rowAverage

This method returns the average of the valid entries in a row, using either the mean or the median, depending on the requested method. The row is passed in as a reference to an array containing the values for the row. If no mean/median could be calculated, then the method returns undef. Only values at validRowIndexes within the passed in array are used in the calculation.

Usage:

        my $average = $self->_rowAverage(\@row, "mean");

_average

This method calculates either the mean or median of a set of data, by receiving the total number of datapoints, an array by reference of all the datavalues, and the sum total of all the datapoints. The former is required to calculate the median (and is not assumed to be sorted), the latter to calculate the mean. The method must also be passed in. The number of datapoints must be non-zero.

Usage:

        my $average = $self->_average("mean", \@data, $total, $numDatapoints);

_centerRow

This protected method takes an array reference to a row, and the average (either mean or median, depending on what was requested), and subtracts that value from every valid value (ie for the valid column indexes) in the row.

Usage:

        $self->_centerRow(\@row, $average);

_calculateMeansAndStdDeviations

This method expects to receive hashes of the sums of X, the sums of X squared and the number of datapoints, where the keys for each hash are the unique identifiers for a series of numbers, whose mean and standard deviations are to be calculated. It returns references to hashes that hash the same ids to the means and standard deviations. It uses the n-1 version of standard deviation. If a standard deviation cannot be calculated, it will be stored as undef.

Usage:

        my ($stddevHashRef, $meansHashRef) = $self->_calculateMeansAndStdDeviations(\%sumOfX, \%sumX2, \%numDataPoints);

_calculateBounds

This method receives two hashes by reference. One is a hash of means, the other a hash of std deviations. It also receives a multiplier. It then calculates, and returns as hash references, the upper and lower bounds for the mean plus or minus that number of deviations. It also receives what line ending it should be using, if being verbose in its reporting.

Usage:

        my ($upperHashRef, $lowerHashRef) = $self->_calculateBounds($stddevHashRef, $meansHashRef, $deviations, $lineEnding);

_giveOverrideMessage

This protected utility method can be used by any subclass that expects its own subclasses to implement certain methods. It can have stub methods, that simply call this method, which will give a standard error message saying that the class 'X' must override method Y.

Usage:

        $self->_giveOverrideMessage();

PUBLIC ACCESSOR METHODS

allowedOperators

This public method returns a sorted array of all the allowed operators that may be used by methods (in subclasses) that employ the operators for whatever reason (their interface should indicate that they employ such operators).

Usage :

    my @operators = $matrix->allowedOperators;

PUBLIC SETTER METHODS

setNumColumnsToReport

This method accepts a positive integer, that indicates the number of columns that have been processed during a filtering/transformation method that is carried out on a column basis, after which progress should be indicated. If a client has not set this value, then it defaults to 50.

Usage :

    $matrix->setNumColumnsToReport(50);

setNumRowsToReport

This method accepts a positive integer, that indicates the number of rows that have been processed during a filtering/transformation method that is carried out on a row basis, after which progress should be indicated. If a client has not set this value, then it defaults to 5000.

Usage :

    $matrix->setNumRowsToReport(5000);

AUTHOR

Gavin Sherlock

sherlock@genome.stanford.edu