DataCube - An Object Oriented Perl Module for Data Mining, Data Warehousing, and creating OLAP cubes.
use strict; use warnings; use DataCube; use DataCube::Schema; # first, make yourself a schema my $schema = DataCube::Schema->new; $schema->add_dimension('country'); $schema->add_dimension('product'); $schema->add_dimension('salesperson'); $schema->add_hierarchy('year','quarter','month','day'); $schema->add_measure('sum','units_sold'); $schema->add_measure('sum','dollar_volume'); $schema->add_measure('average','price_per_unit'); # make a cube from a schema my $cube = DataCube->new($schema); # get your hands on some data my @data = example_sales_data(); # insert your data into the cube, one hashref at a time for(my $i = 0; $i < @data; $i++){ my $date = $data[$i][0]; my($month,$day,$year) = split/\//,$date; my $quarter = $month > 0 && $month < 4 ? 'Q1' : $month > 3 && $month < 7 ? 'Q2' : $month > 6 && $month < 10 ? 'Q3' : $month > 9 && $month < 13 ? 'Q4' : ''; my %data = (); $data{year} = $year; $data{quarter} = $quarter; $data{month} = sprintf("%02d",$month); $data{day} = sprintf("%02d",$day); $data{country} = $data[$i][1]; $data{salesperson} = $data[$i][2]; $data{product} = $data[$i][3]; $data{units_sold} = $data[$i][4]; $data{price_per_unit} = $data[$i][5]; $data{dollar_volume} = $data[$i][6]; $cube->insert(\%data); } # generate all the rollups $cube->rollup; # make all your reports my $target = '/home/david/data_warehouse/reports/sales_data'; $cube->report($target); # alternatively, save your work to disk for later my $cube_store = '/home/david/data_warehouse/cubes/sales_cube'; $cube->commit($cube_store); # congratulations, you now have a data warehouse sub example_sales_data { my $data = ' Date Country SalesPerson Product Units Unit_Cost Total 3/15/2005 US Sorvino Pencil 56 2.99 167.44 3/7/2006 US Sorvino Binder 7 19.99 139.93 8/24/2006 US Sorvino Desk 3 275.00 825.00 9/27/2006 US Sorvino Pen 76 1.99 151.24 5/22/2005 US Thompson Pencil 32 1.99 63.68 10/14/2006 US Thompson Binder 57 19.99 1139.43 4/18/2005 US Andrews Pencil 75 1.99 149.25 4/10/2006 US Andrews Pencil 66 1.99 131.34 10/31/2006 US Andrews Pencil 114 1.29 147.06 12/21/2006 US Andrews Binder 28 4.99 139.72 2/26/2005 CA Gill Pen 51 19.99 1019.49 1/15/2006 CA Gill Binder 46 8.99 413.54 5/14/2006 CA Gill Pencil 94 1.29 121.26 5/31/2006 CA Gill Binder 102 8.99 916.98 9/10/2006 CA Gill Pencil 98 1.29 126.42 2/9/2005 UK Jardine Pencil 125 4.99 623.75 5/5/2005 UK Jardine Pencil 90 4.99 449.10 3/24/2006 UK Jardine PenSet 76 4.99 379.24 11/17/2006 UK Jardine Binder 39 4.99 194.61 12/4/2006 UK Jardine Binder 94 19.99 1879.06 1/23/2005 US Kivell Binder 50 19.99 999.50 11/25/2005 US Kivell PenSet 96 4.99 479.04 6/17/2006 US Kivell Desk 5 125.00 625.00 8/7/2006 US Kivell PenSet 42 23.95 1005.90 6/25/2005 UK Morgan Pencil 90 4.99 449.10 10/5/2005 UK Morgan Binder 28 8.99 251.72 7/21/2006 UK Morgan PenSet 55 12.49 686.95 9/1/2005 US Smith Desk 2 125.00 250.00 12/12/2005 US Smith Pencil 67 1.29 86.43 2/1/2006 US Smith Binder 87 15.00 1305.00 7/12/2005 US Howard Binder 29 1.99 57.71 4/27/2006 US Howard Pen 96 4.99 479.04 1/6/2005 CA Jones Pencil 95 1.99 189.05 4/1/2005 CA Jones Binder 76 4.99 379.24 6/8/2005 CA Jones Binder 60 8.99 539.40 8/15/2005 US Jones Pencil 35 4.99 174.65 9/18/2005 US Jones PenSet 16 15.99 255.84 10/22/2005 US Jones Pen 64 8.99 575.36 2/18/2006 CA Jones Binder 4 4.99 19.96 7/4/2006 CA Jones PenSet 61 4.99 304.39 7/29/2005 UK Hogan Binder 81 19.99 1619.19 11/8/2005 UK Hogan Pen 12 19.99 239.88 12/29/2005 UK Hogan PenSet 74 15.99 1183.26 '; my @data = map { [ grep { /\S/ } split/\s+/] } grep { /\S/ } split/\n+/,$data; shift @data; return @data; }
This module provides a pure perl, object oriented, embeddable data cubing engine. It is self contained and ready to use in data mining and data warehousing applications.
This module provides several methods to create, store, modify and access DataCubes.
These methods expose the core DataCube API. They are presented in the order in which you will probably want to use them.
All examples below follow the example from the new method.
The new constructor should be used with a schema like so:
use DataCube; use DataCube::Schema; my $schema = DataCube::Schema->new; $schema->add_dimension('foo'); $schema->add_dimension('bar'); $schema->add_hierarchy('goo','gaz','waka_waka'); $schema->add_measure('count'); $schema->add_measure('count','stuff'); $schema->add_measure('average','blah_blah_blah'); my $cube = DataCube->new($schema);
Now that you have a data cube, insert your data, one hashref at a time:
foreach $row (@in_some_data_set) { my %data = (); $data{foo} = $row[0]; $data{bar} = $row[1]; $data{goo} = $row[2]; $data{gaz} = $row[3]; $data{stuff} = $row[4]; $data{waka_waka} = $row[5]; $data{blah_blah_blah} = $row[6]; $cube->insert(\%data); }
The insert method returns a unique identifier of the inserted record.
Notice that each field from the schema is present and populated by name at insertion time.
DataCube will *not* under any circumstances perform sanity checks for you, ever.
The load_data_infile method will batch insert the contents of an entire text file into a cube;
$cube->load_data_infile( $file );
The file format is simple:
1. tab delimited fields 2. new line delimited records 3. column names at the top. the column names must be the same as from your schema.
Example:
Let's I have a file called 'revenue.tsv' which looks like this:
country product year revenue us pens 2009 1856.45 us pencils 2008 20495.90 [ ... ]
Then I can load it into a cube as follows:
my $schema = DataCube::Schema->new; $schema->add_dimension('year'); $schema->add_dimension('country'); $schema->add_dimension('product'); $schema->add_measure('sum','revenue'); my $cube = DataCube->new( $schema ); $cube->load_data_infile('revenue.tsv');
Note: it's ok if your text file contains more columns than your schema.
The rollup method will perform all the aggregations along all the dimensions and measures specified in your schema.
It will happen in memory and very quickly.
This method explodes if the cube has already been rolled up when called.
The lazy_rollup method will perform all the aggregations along all the dimensions and measures specified in your schema, iteratively, as opposed to all at once.
An example use case:
my $target = '/media/Raptor/reports'; while(my $next_cube = $cube->lazy_rollup) { $next_cube->report($target); }
The advantage is that far less ram is used and the time it takes to generate the rollups is increased only slightly.
The commit method saves your work to disk. It will create its own directory tree and will store files using its own internal logic.
Emmited files are binary, not human readable, and should not be modified by anything but DataCube.
my $target = 'path/to/some/folder/that/exists''; $cube->commit($target);
If the commit target exists, it will be updated in place and will reflect all commits to date.
The sync method uses the binary files in a commit target to create flat text file reports.
$cube->sync($target);
Flat text files will be located alongside their binary files with the name '.report'; In our example, 16 rollup tables will have been created and commited to disk. They will have 32 letter hex names:
my $target = 'path/to/some/folder/that/exists'; $cube->commit($target); # suppose you now have a folder called # 'path/to/some/folder/that/exists/882e5cd3dd0a5a69ea992891621c6715' # you can now call sync: $cube->sync($target); # to create a flat text file report at this location: # 'path/to/some/folder/that/exists/882e5cd3dd0a5a69ea992891621c6715/.report'
Before the call to sync is made, the contents of the .report file may not exist or may exist but may not be current. Be careful.
The report method creates flat text file reports, in the specified location.
$cube->report($target);
Flat text files will be located in the $target directory, with human readable names.
If two directories are passed to the report method, then flat text file reports with be generated from a prior commit target like so:
my $commit_target = 'data_warehouse/cubes/my_cube'; my $report_target = 'data_warehouse/reports/my_reports'; $cube->commit($commit_target); $cube->report($commit_target, $report_target);
Now, the $report_target directory will contain reports which represent the cumulative work in the $commit_target
The report_html method is the same as the report method but produces html reports instead of flat text files. See 'report';
These methods provide basic read / write access to the internal contents of the cube.
The reset method resets the internal state of the cube :
$cube->reset;
This is equivalent to
$cube = Datacube->new( $cube->schema );
In other words, the reset method empties the data from all the cubes internal tables.
The unroll method reverts the cube to its pre-rollup state and has no effect on unrolled cubes:
$cube->unroll;
It is implemented as follows:
sub unroll { my($self) = @_; for(keys %{$self->cube_store->cubes}){ my $table = $self->cube_store->fetch($_); $table->reset unless $table->schema->name eq $self->base_cube_name; } delete $self->{meta_data}->{system}->{has_been_rolled_up}; return $self; }
The reset_measures method allows you to reset all measure values associated with a specific tuple, to their default values. Sums are set to 0, products to 1, internal count distinct hashes are reset to {}, etc.
An example:
use YAML; sub dump_yaml { print YAML::Dump \@_ } my %data = ( year => '2005', quarter => 'Q1', month => '03', day => '15', country => 'US', salesperson => 'Sorvino', product => 'Pencil', ); dump_yaml($cube->get_measures(\%data)); $cube->reset_measures(\%data); dump_yaml($cube->get_measures(\%data)); # -------------------------------------------------------------------------- # prints # -------------------------------------------------------------------------- # average: # price_per_unit: # observations: 1 # sum_total: 2.99 # sum: # dollar_volume: 167.44 # units_sold: 56 # # # average: # price_per_unit: # observations: 0 # sum_total: 0 # sum: # dollar_volume: 0 # units_sold: 0 # --------------------------------------------------------------------------
The delete method removes the index and measures associated with a specific tuple.
my %data = ( year => '2005', quarter => 'Q1', month => '03', day => '15', country => 'US', salesperson => 'Sorvino', product => 'Pencil', ); $cube->delete(\%data); # base table no longer contains 'US 15 03 Pencil Q1 Sorvino 2005'
The decrement_key_count method decreases the unqualified 'count' measure associated with a dimension instance by 1.
$cube->decrement_key_count(\%data);
These methods modify the contents of ('count','field') and ('multi_count','field') measures.
$cube->drop_count('users',\%data); $cube->decrement_multi_count('field',\%data);
This warrants a small example:
use strict; use warnings; use DataCube; use DataCube::Schema; my $schema = DataCube::Schema->new; $schema->add_dimension('year'); $schema->add_dimension('country'); $schema->add_measure('count','product'); $schema->add_measure('multi_count','salesperson'); my $cube = DataCube->new($schema); $cube->load_data_infile('sales.tsv'); my %data = ( year => '2006', country => 'US', product => 'Pencil', salesperson => 'Andrews', ); my $filter = sub { my($data) = @_; return 1 if $data->{datakey}->{year} eq '2006' && $data->{datakey}->{country} eq 'US' }; my @query = ('country','year',$filter); dmp($cube->query_measures( @query )); $cube->drop_count('product',\%data); dmp($cube->query_measures( @query )); $cube->decrement_multi_count('salesperson',\%data); dmp($cube->query_measures( @query )); sub dmp { use Data::Dumper; print Dumper \@_; }
This prints the following (without the comments):
# first, here is the original row in the base table # -------------------------------------------------------------------------------- $VAR1 = [ { 'datakey' => { 'country' => 'US', 'year' => '2006' }, 'measures' => { 'count' => { 'product' => { 'Pen' => undef, 'Desk' => undef, 'PenSet' => undef, 'Pencil' => undef, 'Binder' => undef } }, 'multi_count' => { 'salesperson' => { 'Howard' => 1, 'Sorvino' => 3, 'Smith' => 1, 'Andrews' => 3, 'Kivell' => 2, 'Thompson' => 1 } } } } ]; # after 'drop_count' # note that Pencils is now removed from the count of distinct products # -------------------------------------------------------------------------------- $VAR1 = [ { 'datakey' => { 'country' => 'US', 'year' => '2006' }, 'measures' => { 'count' => { 'product' => { 'Pen' => undef, 'Desk' => undef, 'PenSet' => undef, 'Binder' => undef } }, 'multi_count' => { 'salesperson' => { 'Howard' => 1, 'Sorvino' => 3, 'Smith' => 1, 'Andrews' => 3, 'Kivell' => 2, 'Thompson' => 1 } } } } ]; # after 'decrement_multi_count' # note that Andrews has made one fewer sale for 'US 2006' # -------------------------------------------------------------------------------- $VAR1 = [ { 'datakey' => { 'country' => 'US', 'year' => '2006' }, 'measures' => { 'count' => { 'product' => { 'Pen' => undef, 'Desk' => undef, 'PenSet' => undef, 'Binder' => undef } }, 'multi_count' => { 'salesperson' => { 'Howard' => 1, 'Sorvino' => 3, 'Smith' => 1, 'Andrews' => 2, 'Kivell' => 2, 'Thompson' => 1 } } } } ]; # get it? # --------------------------------------------------------------------------------
Granted, these methods are technical and should be used with great caution.
These methods explode if the cube has already been rolled up when called.
These methods provide wrappers around slightly complicated operations. They are meant to be sugary and save you development time.
The lazy_rc method uses an iterative rollup algorithm to rollup and commit the cube. It will be more memory efficient and slightly slower than first calling rollup and then calling commit.
my $target = '/some/folder'; $cube->insert( a bunch of data *see above* ); $cube->lazy_rc($target);
lazy_rc will generally use twice as much virtual memory as your perl process did right after inserting a bunch of data (as opposed to n times where n is the number of internal lattice points (ie different rollups)).
The store and retrieve methods provide ways to store your cube to a single file on disk and slurp it back in later.
$cube->store('sales.cube'); # then at the beginning of some other script: my $cube = DataCube->new->retrieve('sales.cube'); # which is equivalent to # ------------------------------------------------ # $cube = DataCube->new; # $cube->retrieve('sales.cube');
The new constructor can also take this file path instead of a schema object:
$cube->store('sales.cube'); my $clone = DataCube->new('sales.cube');
which brings us to:
The clone methods provides a deep copy of a data cube.
my $clone = $cube->clone; # now anything done to $clone does not affect $cube
These methods help you peek inside your cube.
Its like mysql describe but for data cubes. Kewl huh.
Produces text output. Try it:
use DataCube; use DataCube::Schema; my $schema = DataCube::Schema->new; $schema->add_dimension('foo'); $schema->add_dimension('bar'); $schema->add_hierarchy('goo','gaz','waka_waka'); $schema->add_measure('count'); $schema->add_measure('count','stuff'); $schema->add_measure('average','blah_blah_blah'); my $cube = DataCube->new($schema); $cube->describe;
The get_measures method provides real-time access to the cube.
Consider the example in the synopsis section:
[...] $cube->rollup; my %data = ( country => 'US', salesperson => 'Sorvino' ); my $measures = $cube->get_measures(\%data);
This will query the 'country', 'salesperson' table for the instance 'US, Sorvino' and will return a hashref of the measures and their values.
You can query the cube's base table at insertion time like so:
my %data = ( ... ); $cube->insert(\%data); my $measures = $cube->get_measures(\%data);
For a richer query interface, see query_measures or DataCube::Query; For information about measures, see DataCube::Schema;
The get_measures_by_id method fetches measures associated with a unique record identifier from the base table.
my $record_id = $cube->insert(\%data); my $measures = $cube->get_measures_by_id( $record_id ); # same as $cube->get_measures(\%data);
Note: This method only works on the base table.
The query_measures method provides real-time search for data cubes.
[...] $cube->rollup; my $filter = sub { my($data) = @_; my $datakey = $data->{datakey}; my $measures = $data->{measures}; return 1 if $datakey->{country} =~ /^(us|ca)$/i or $measures->{sum}->{units_sold} > 200 or $measures->{sum}->{dollar_volume} > 2500 }; my @results = $cube->query_measures('country','salesperson', $filter);
This will query the 'country', 'salesperson' table in a case insensitive fashion for all rows where any of the following conditions apply:
1. the country is 'us' or 'ca' 2. more than 200 units have been sold 2. more than 2500 dollars in revenue have been generated
Results will be returned as an array of hash references, in the same format expected by the $filter callback above.
These methods provide infomation about the state of the cube.
This method will return true if the cube has been rolled up, 0 otherwise.
Note: lazy_rc does not set the has_been_rolled_up flag to true.
There are 4 rules for DataCube that must never be broken.
1. Your data may not contain tab character(s) 2. You may not have a dimension called 'overall' 3. All named entries from your schema must exist and be defined at insertion time 4. Do not use double underscores in the names of dimensions, hierarchies or measure fields.
use DataCube; use DataCube::Schema; my $schema = DataCube::Schema->new; $schema->add_dimension('foo'); $schema->add_dimension('bar'); $schema->add_measure('count'); # breaks rule 1 # --------------------------- $cube->insert({ foo => 12, bar => "\tsome_text", }); # breaks rule 3 # --------------------------- $cube->insert({ bar => "some_text", });
Incidentally, DataCube will not check these for you, ever, so consider yourself warned.
This module does not export anything. It is object oriented.
The DataCube API provides operations that guarantee reliable transactions. Please see the blackbear cookbook for details.
Wikipedia on OLAP Cubes:
http://en.wikipedia.org/wiki/OLAP_cube
Other Data Cubing Engines:
David Williams, <david@namimedia.com>
Copyright (C) 2009-Now by David Williams
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.
To install DataCube, copy and paste the appropriate command in to your terminal.
cpanm
cpanm DataCube
CPAN shell
perl -MCPAN -e shell install DataCube
For more information on module installation, please visit the detailed CPAN module installation guide.