The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DataCube - An Object Oriented Perl Module for Data Mining, Data Warehousing, and creating OLAP cubes.

SYNOPSIS

  use strict;
  use warnings;
  
  use DataCube;
  use DataCube::Schema;
  
  
  
  # first, make yourself a schema
  
  my $schema = DataCube::Schema->new;
  
  $schema->add_dimension('country');
  $schema->add_dimension('product');
  $schema->add_dimension('salesperson');
  
  $schema->add_hierarchy('year','quarter','month','day');
  
  $schema->add_measure('sum','units_sold');
  $schema->add_measure('sum','dollar_volume');
  $schema->add_measure('average','price_per_unit');
  
  
  
  # make a cube from a schema
  
  my $cube = DataCube->new($schema);
  
  
  
  # get your hands on some data

  my @data = example_sales_data();
  
  
  
  # insert your data into the cube, one hashref at a time
  
  for(my $i = 0; $i < @data; $i++){  
    
    my $date  = $data[$i][0];
    
    my($month,$day,$year) = split/\//,$date;
    
    my $quarter = $month > 0 && $month < 4  ? 'Q1' :
                  $month > 3 && $month < 7  ? 'Q2' :
                  $month > 6 && $month < 10 ? 'Q3' :
                  $month > 9 && $month < 13 ? 'Q4' : '';
    
    my %data = ();
    
    $data{year}           = $year;
    $data{quarter}        = $quarter;
    $data{month}          = sprintf("%02d",$month);
    $data{day}            = sprintf("%02d",$day);
    $data{country}        = $data[$i][1];
    $data{salesperson}    = $data[$i][2];
    $data{product}        = $data[$i][3];
    $data{units_sold}     = $data[$i][4];
    $data{price_per_unit} = $data[$i][5];
    $data{dollar_volume}  = $data[$i][6];
    
    $cube->insert(\%data);
    
  }
  
  # generate all the rollups
  
  $cube->rollup;
  
  
  # make all your reports
  
  my $target = '/home/david/data_warehouse/reports/sales_data';
  
  $cube->report($target);
  
  
  # alternatively, save your work to disk for later 
  
  my $cube_store = '/home/david/data_warehouse/cubes/sales_cube';
  
  $cube->commit($cube_store);
  
  
  # congratulations, you now have a data warehouse 
  
  
  sub example_sales_data {
    my $data = '
        Date         Country   SalesPerson     Product     Units   Unit_Cost       Total
        3/15/2005         US       Sorvino      Pencil        56        2.99      167.44
        3/7/2006          US       Sorvino      Binder         7       19.99      139.93
        8/24/2006         US       Sorvino        Desk         3      275.00      825.00
        9/27/2006         US       Sorvino         Pen        76        1.99      151.24
        5/22/2005         US      Thompson      Pencil        32        1.99       63.68
        10/14/2006        US      Thompson      Binder        57       19.99     1139.43
        4/18/2005         US       Andrews      Pencil        75        1.99      149.25
        4/10/2006         US       Andrews      Pencil        66        1.99      131.34
        10/31/2006        US       Andrews      Pencil       114        1.29      147.06
        12/21/2006        US       Andrews      Binder        28        4.99      139.72
        2/26/2005         CA          Gill         Pen        51       19.99     1019.49
        1/15/2006         CA          Gill      Binder        46        8.99      413.54
        5/14/2006         CA          Gill      Pencil        94        1.29      121.26
        5/31/2006         CA          Gill      Binder       102        8.99      916.98
        9/10/2006         CA          Gill      Pencil        98        1.29      126.42
        2/9/2005          UK       Jardine      Pencil       125        4.99      623.75
        5/5/2005          UK       Jardine      Pencil        90        4.99      449.10
        3/24/2006         UK       Jardine      PenSet        76        4.99      379.24
        11/17/2006        UK       Jardine      Binder        39        4.99      194.61
        12/4/2006         UK       Jardine      Binder        94       19.99     1879.06
        1/23/2005         US        Kivell      Binder        50       19.99      999.50
        11/25/2005        US        Kivell      PenSet        96        4.99      479.04
        6/17/2006         US        Kivell        Desk         5      125.00      625.00
        8/7/2006          US        Kivell      PenSet        42       23.95     1005.90
        6/25/2005         UK        Morgan      Pencil        90        4.99      449.10
        10/5/2005         UK        Morgan      Binder        28        8.99      251.72
        7/21/2006         UK        Morgan      PenSet        55       12.49      686.95
        9/1/2005          US         Smith        Desk         2      125.00      250.00
        12/12/2005        US         Smith      Pencil        67        1.29       86.43
        2/1/2006          US         Smith      Binder        87       15.00     1305.00
        7/12/2005         US        Howard      Binder        29        1.99       57.71
        4/27/2006         US        Howard         Pen        96        4.99      479.04
        1/6/2005          CA         Jones      Pencil        95        1.99      189.05
        4/1/2005          CA         Jones      Binder        76        4.99      379.24
        6/8/2005          CA         Jones      Binder        60        8.99      539.40
        8/15/2005         US         Jones      Pencil        35        4.99      174.65
        9/18/2005         US         Jones      PenSet        16       15.99      255.84
        10/22/2005        US         Jones         Pen        64        8.99      575.36
        2/18/2006         CA         Jones      Binder         4        4.99       19.96
        7/4/2006          CA         Jones      PenSet        61        4.99      304.39
        7/29/2005         UK         Hogan      Binder        81       19.99     1619.19
        11/8/2005         UK         Hogan         Pen        12       19.99      239.88
        12/29/2005        UK         Hogan      PenSet        74       15.99     1183.26
    ';
    
        my @data = map { [ grep { /\S/ } split/\s+/] } grep { /\S/ } split/\n+/,$data;
        shift @data;
        return @data;
    }

DESCRIPTION

This module provides a pure perl, object oriented, embeddable data cubing engine. It is self contained and ready to use in data mining and data warehousing applications.

OBJECT METHODS

This module provides several methods to create, store, modify and access DataCubes.

Core Methods

    These methods expose the core DataCube API. They are presented in the order in which you will probably want to use them.

    All examples below follow the example from the new method.

new

    The new constructor should be used with a schema like so:

      use DataCube;
      use DataCube::Schema;
      
      my $schema = DataCube::Schema->new;
      
      $schema->add_dimension('foo');
      $schema->add_dimension('bar');
      
      $schema->add_hierarchy('goo','gaz','waka_waka');
      
      $schema->add_measure('count');
      $schema->add_measure('count','stuff');
      $schema->add_measure('average','blah_blah_blah');
      
      my $cube = DataCube->new($schema);

insert

    Now that you have a data cube, insert your data, one hashref at a time:

      foreach $row (@in_some_data_set) {  
        
        my %data = ();
        
        $data{foo}             = $row[0];
        $data{bar}             = $row[1];
        $data{goo}             = $row[2];
        $data{gaz}             = $row[3];
        $data{stuff}           = $row[4];
        $data{waka_waka}       = $row[5];
        $data{blah_blah_blah}  = $row[6];
        
        $cube->insert(\%data);
        
      }

    The insert method returns a unique identifier of the inserted record.

    Notice that each field from the schema is present and populated by name at insertion time.

    DataCube will *not* under any circumstances perform sanity checks for you, ever.

load_data_infile

    The load_data_infile method will batch insert the contents of an entire text file into a cube;

        $cube->load_data_infile( $file );

    The file format is simple:

        1. tab delimited fields
        2. new line delimited records
        3. column names at the top.  the column names must be the same as from your schema.

    Example:

    Let's I have a file called 'revenue.tsv' which looks like this:

        country   product    year    revenue
        us        pens       2009    1856.45
        us        pencils    2008   20495.90
        
        [ ... ]

    Then I can load it into a cube as follows:

        my $schema = DataCube::Schema->new;
        
        $schema->add_dimension('year');
        $schema->add_dimension('country');
        $schema->add_dimension('product');
        
        $schema->add_measure('sum','revenue');
        
        my $cube = DataCube->new( $schema );
        
        $cube->load_data_infile('revenue.tsv');
        

    Note: it's ok if your text file contains more columns than your schema.

rollup

    The rollup method will perform all the aggregations along all the dimensions and measures specified in your schema.

    It will happen in memory and very quickly.

    This method explodes if the cube has already been rolled up when called.

lazy_rollup

    The lazy_rollup method will perform all the aggregations along all the dimensions and measures specified in your schema, iteratively, as opposed to all at once.

    An example use case:

      my $target = '/media/Raptor/reports';
      
      while(my $next_cube = $cube->lazy_rollup) {
          $next_cube->report($target);
      }

    The advantage is that far less ram is used and the time it takes to generate the rollups is increased only slightly.

commit

    The commit method saves your work to disk. It will create its own directory tree and will store files using its own internal logic.

    Emmited files are binary, not human readable, and should not be modified by anything but DataCube.

      my $target = 'path/to/some/folder/that/exists'';
      
      $cube->commit($target);

    If the commit target exists, it will be updated in place and will reflect all commits to date.

sync

    The sync method uses the binary files in a commit target to create flat text file reports.

      $cube->sync($target);

    Flat text files will be located alongside their binary files with the name '.report'; In our example, 16 rollup tables will have been created and commited to disk. They will have 32 letter hex names:

      my $target = 'path/to/some/folder/that/exists';
      
      $cube->commit($target);
    
      #  suppose you now have a folder called
      # 'path/to/some/folder/that/exists/882e5cd3dd0a5a69ea992891621c6715'
      
      #  you can now call sync:
      
      $cube->sync($target);
    
      #  to create a flat text file report at this location:
      # 'path/to/some/folder/that/exists/882e5cd3dd0a5a69ea992891621c6715/.report'

    Before the call to sync is made, the contents of the .report file may not exist or may exist but may not be current. Be careful.

report

    The report method creates flat text file reports, in the specified location.

      $cube->report($target);

    Flat text files will be located in the $target directory, with human readable names.

    If two directories are passed to the report method, then flat text file reports with be generated from a prior commit target like so:

      my $commit_target = 'data_warehouse/cubes/my_cube';
      my $report_target = 'data_warehouse/reports/my_reports';
      
      $cube->commit($commit_target);
      $cube->report($commit_target, $report_target);

    Now, the $report_target directory will contain reports which represent the cumulative work in the $commit_target

report_html

    The report_html method is the same as the report method but produces html reports instead of flat text files. See 'report';

Access Methods

    These methods provide basic read / write access to the internal contents of the cube.

reset

    The reset method resets the internal state of the cube :

        $cube->reset;

    This is equivalent to

        $cube = Datacube->new( $cube->schema );

    In other words, the reset method empties the data from all the cubes internal tables.

unroll

    The unroll method reverts the cube to its pre-rollup state and has no effect on unrolled cubes:

        $cube->unroll;

    It is implemented as follows:

        sub unroll {
            
            my($self) = @_;
            
            for(keys %{$self->cube_store->cubes}){
                
                my $table = $self->cube_store->fetch($_);
                
                $table->reset unless $table->schema->name eq $self->base_cube_name;
                
            }
            
            delete $self->{meta_data}->{system}->{has_been_rolled_up};
            
            return $self;
        }

reset_measures

    The reset_measures method allows you to reset all measure values associated with a specific tuple, to their default values. Sums are set to 0, products to 1, internal count distinct hashes are reset to {}, etc.

    An example:

       use YAML;
       sub dump_yaml { print YAML::Dump \@_ }
     
       my %data = (
            year        => '2005',
            quarter     => 'Q1',
            month       => '03',
            day         => '15',
            country     => 'US',
            salesperson => 'Sorvino',
            product     => 'Pencil',
        );
    
    
        dump_yaml($cube->get_measures(\%data));
    
        $cube->reset_measures(\%data);
    
        dump_yaml($cube->get_measures(\%data));
    
        # --------------------------------------------------------------------------
        # prints
        # --------------------------------------------------------------------------
        # average:
        #   price_per_unit:
        #     observations: 1
        #     sum_total: 2.99
        # sum:
        #   dollar_volume: 167.44
        #   units_sold: 56
        #
        #
        # average:
        #   price_per_unit:
        #     observations: 0
        #     sum_total: 0
        # sum:
        #   dollar_volume: 0
        #   units_sold: 0
        # --------------------------------------------------------------------------

    This method explodes if the cube has already been rolled up when called.

delete

    The delete method removes the index and measures associated with a specific tuple.

    An example:

        my %data = (
            year        => '2005',
            quarter     => 'Q1',
            month       => '03',
            day         => '15',
            country     => 'US',             
            salesperson => 'Sorvino',
            product     => 'Pencil',
        );
        
        $cube->delete(\%data);
        
        # base table no longer contains 'US 15  03  Pencil  Q1  Sorvino 2005'

    This method explodes if the cube has already been rolled up when called.

decrement_key_count

    The decrement_key_count method decreases the unqualified 'count' measure associated with a dimension instance by 1.

    An example:

        $cube->decrement_key_count(\%data);

    This method explodes if the cube has already been rolled up when called.

drop_count / decrement_multi_count

    These methods modify the contents of ('count','field') and ('multi_count','field') measures.

    An example:

        $cube->drop_count('users',\%data);
        
        $cube->decrement_multi_count('field',\%data);

    This warrants a small example:

        use strict;
        use warnings;
            
        use DataCube;
        use DataCube::Schema;
            
        my $schema = DataCube::Schema->new;
        
        $schema->add_dimension('year');
        $schema->add_dimension('country');
        
        $schema->add_measure('count','product');
        $schema->add_measure('multi_count','salesperson');
        
        my $cube = DataCube->new($schema);
        
        $cube->load_data_infile('sales.tsv');
        
        my %data = (
            year        => '2006',
            country     => 'US',
            product     => 'Pencil',
            salesperson => 'Andrews',
        );
        
        my $filter = sub {
            my($data) = @_;
            return 1
                if $data->{datakey}->{year}        eq '2006'
                && $data->{datakey}->{country}     eq 'US'
        };
        
        my @query = ('country','year',$filter);
        
        dmp($cube->query_measures( @query ));
        
        $cube->drop_count('product',\%data);
        
        dmp($cube->query_measures(  @query ));
            
        $cube->decrement_multi_count('salesperson',\%data);
        
        dmp($cube->query_measures(  @query ));
    
        sub dmp {
            use Data::Dumper;
            print Dumper \@_;
        }

    This prints the following (without the comments):

        # first, here is the original row in the base table
        # --------------------------------------------------------------------------------
        
        $VAR1 = [
                  {
                    'datakey' => {
                                   'country' => 'US',
                                   'year' => '2006'
                                 },
                    'measures' => {
                                    'count' => {
                                                 'product' => {
                                                                'Pen' => undef,
                                                                'Desk' => undef,
                                                                'PenSet' => undef,
                                                                'Pencil' => undef,
                                                                'Binder' => undef
                                                              }
                                               },
                                    'multi_count' => {
                                                       'salesperson' => {
                                                                          'Howard' => 1,
                                                                          'Sorvino' => 3,
                                                                          'Smith' => 1,
                                                                          'Andrews' => 3,
                                                                          'Kivell' => 2,
                                                                          'Thompson' => 1
                                                                        }
                                                     }
                                  }
                  }
                ];
        
        # after 'drop_count'
        # note that Pencils is now removed from the count of distinct products
        # --------------------------------------------------------------------------------
        
        $VAR1 = [
                  {
                    'datakey' => {
                                   'country' => 'US',
                                   'year' => '2006'
                                 },
                    'measures' => {
                                    'count' => {
                                                 'product' => {
                                                                'Pen' => undef,
                                                                'Desk' => undef,
                                                                'PenSet' => undef,
                                                                'Binder' => undef
                                                              }
                                               },
                                    'multi_count' => {
                                                       'salesperson' => {
                                                                          'Howard' => 1,
                                                                          'Sorvino' => 3,
                                                                          'Smith' => 1,
                                                                          'Andrews' => 3,
                                                                          'Kivell' => 2,
                                                                          'Thompson' => 1
                                                                        }
                                                     }
                                  }
                  }
                ];
        
        # after 'decrement_multi_count'
        # note that Andrews has made one fewer sale for 'US  2006'
        # --------------------------------------------------------------------------------
        
        $VAR1 = [
                  {
                    'datakey' => {
                                   'country' => 'US',
                                   'year' => '2006'
                                 },
                    'measures' => {
                                    'count' => {
                                                 'product' => {
                                                                'Pen' => undef,
                                                                'Desk' => undef,
                                                                'PenSet' => undef,
                                                                'Binder' => undef
                                                              }
                                               },
                                    'multi_count' => {
                                                       'salesperson' => {
                                                                          'Howard' => 1,
                                                                          'Sorvino' => 3,
                                                                          'Smith' => 1,
                                                                          'Andrews' => 2,
                                                                          'Kivell' => 2,
                                                                          'Thompson' => 1
                                                                        }
                                                     }
                                  }
                  }
                ];
        
        # get it?
        # --------------------------------------------------------------------------------

    Granted, these methods are technical and should be used with great caution.

    These methods explode if the cube has already been rolled up when called.

Convenience Methods

    These methods provide wrappers around slightly complicated operations. They are meant to be sugary and save you development time.

lazy_rc

    The lazy_rc method uses an iterative rollup algorithm to rollup and commit the cube. It will be more memory efficient and slightly slower than first calling rollup and then calling commit.

       my $target = '/some/folder';
       
       $cube->insert( a bunch of data *see above* );
       
       $cube->lazy_rc($target);

    lazy_rc will generally use twice as much virtual memory as your perl process did right after inserting a bunch of data (as opposed to n times where n is the number of internal lattice points (ie different rollups)).

store / retrieve

    The store and retrieve methods provide ways to store your cube to a single file on disk and slurp it back in later.

    Example:

        $cube->store('sales.cube');
        
        # then at the beginning of some other script:
        
        my $cube = DataCube->new->retrieve('sales.cube');
        
        # which is equivalent to
        # ------------------------------------------------
        # $cube = DataCube->new;
        # $cube->retrieve('sales.cube');
        

    The new constructor can also take this file path instead of a schema object:

        $cube->store('sales.cube');
        
        my $clone = DataCube->new('sales.cube');

    which brings us to:

clone

    The clone methods provides a deep copy of a data cube.

    Example:

        my $clone = $cube->clone;
        
        # now anything done to $clone does not affect $cube

Examination Methods

    These methods help you peek inside your cube.

describe

    Its like mysql describe but for data cubes. Kewl huh.

    Produces text output. Try it:

      use DataCube;
      use DataCube::Schema;
      
      my $schema = DataCube::Schema->new;
      
      $schema->add_dimension('foo');
      $schema->add_dimension('bar');
      
      $schema->add_hierarchy('goo','gaz','waka_waka');
      
      $schema->add_measure('count');
      $schema->add_measure('count','stuff');
      $schema->add_measure('average','blah_blah_blah');
      
      my $cube = DataCube->new($schema);
      
      $cube->describe;

get_measures

    The get_measures method provides real-time access to the cube.

    Consider the example in the synopsis section:

      [...]
      
      $cube->rollup;
      
      my %data = (
        country     => 'US',
        salesperson => 'Sorvino'
      );
      
      my $measures = $cube->get_measures(\%data);

    This will query the 'country', 'salesperson' table for the instance 'US, Sorvino' and will return a hashref of the measures and their values.

    You can query the cube's base table at insertion time like so:

      my %data = ( ... );
      
      $cube->insert(\%data);
      
      my $measures = $cube->get_measures(\%data);

    For a richer query interface, see query_measures or DataCube::Query; For information about measures, see DataCube::Schema;

get_measures_by_id

    The get_measures_by_id method fetches measures associated with a unique record identifier from the base table.

        my $record_id = $cube->insert(\%data);
        
        my $measures = $cube->get_measures_by_id( $record_id );
        
        # same as $cube->get_measures(\%data);

    Note: This method only works on the base table.

query_measures

    The query_measures method provides real-time search for data cubes.

    Consider the example in the synopsis section:

        [...]
        
        $cube->rollup;
        
        my $filter = sub {
            my($data) = @_;
            
            my $datakey  = $data->{datakey};
            my $measures = $data->{measures};
            
            return 1 if
                   $datakey->{country} =~ /^(us|ca)$/i
                or $measures->{sum}->{units_sold}    > 200
                or $measures->{sum}->{dollar_volume} > 2500
        };
        
        my @results = $cube->query_measures('country','salesperson', $filter);

    This will query the 'country', 'salesperson' table in a case insensitive fashion for all rows where any of the following conditions apply:

        1. the country is 'us' or 'ca'
        2. more than 200 units have been sold
        2. more than 2500 dollars in revenue have been generated

    Results will be returned as an array of hash references, in the same format expected by the $filter callback above.

Status Methods

    These methods provide infomation about the state of the cube.

has_been_rolled_up

    This method will return true if the cube has been rolled up, 0 otherwise.

    Note: lazy_rc does not set the has_been_rolled_up flag to true.

IMPORTANT

There are 4 rules for DataCube that must never be broken.

    1.  Your data may not contain tab character(s)
    2.  You may not have a dimension called 'overall'  
    3.  All named entries from your schema must exist and be defined at insertion time
    4.  Do not use double underscores in the names of dimensions, hierarchies or measure fields.

Example:

    use DataCube;
    use DataCube::Schema;
  
    my $schema = DataCube::Schema->new;
  
    $schema->add_dimension('foo');
    $schema->add_dimension('bar');
  
    $schema->add_measure('count');
  
  
    # breaks rule 1
    # ---------------------------
    
    $cube->insert({
        foo => 12,
        bar => "\tsome_text",
    });


    # breaks rule 3
    # ---------------------------
    
    $cube->insert({
        bar => "some_text",
    });

Incidentally, DataCube will not check these for you, ever, so consider yourself warned.

EXPORT

This module does not export anything. It is object oriented.

NOTES

ACID Compliance

    The DataCube API provides operations that guarantee reliable transactions. Please see the blackbear cookbook for details.

SEE ALSO

Wikipedia on OLAP Cubes:

http://en.wikipedia.org/wiki/OLAP_cube

Other Data Cubing Engines:

AUTHOR

David Williams, <david@namimedia.com>

COPYRIGHT AND LICENSE

Copyright (C) 2009-Now by David Williams

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.