David Williams > DataCube > DataCube::Cookbook

DataCube.tgz

Dependencies

Annotate this POD

# CPAN RT

 New 1 Open 0
View/Report Bugs

# NAME

Cookbook - Tips and Tricks for Business Intelligence with DataCube

# DESCRIPTION

This document contains a collection of practical techniques for accomplishing large scale Business Intelligence tasks with DataCube. These techniques are what I am using at the moment. Feel free to use them, hack them, turn them inside out and do it again.

I hope they will save you time.

## Emulating SQL functions

### Problem

```    You want to emulate SQL's

count(column_name)

count(distinct column_name))

aggregation functions.```

### Solution

```    Use the count measure provided by DataCube::Schema

Example: Suppose you have a database that stores sales data.
In particular, suppose that the salesperson, country and product
associated with each sale is recorded.  Set up a Schema like so:

my \$schema = DataCube::Schema->new;

### Discussion

```    Any cube made from this schema will track the number of sales made
by each salesperson in each country, along with the corresponding number
of different products.

The SQL equivalent of the cube's base_table would be as follows:

select
country
, salesperson
, count(*)
, count(distinct products)
from
sales
group by
country, salesperson```

## use less 'ram';

### Problem

`    You want to use less memory associated with rollup`

### Solution

```    Use lazy_rollup.

while( my \$next_cube = \$cube->lazy_rollup ) {

\$next_cube->commit( 'data_warehouse/archive' );

}```

### Discussion

```    the rollup method uses:

O( m )  space complexity
O( log(m) ) time  complexity

whereas the lazy_rollup method uses:

O( 1 ) space complexity
O( m ) time  complexity

where m is the number of rollup tables (aka reports) to be generated.

These are estimates.  The real best / worst / average case runtime complexity
until I prove the actual bounds. ```

## use more 'cpus';

### Problem

```    You want to use process threads to launch parallel cube workers

In particular, you want to commit a cube in parallel with feeding it more data.    ```

### Solution

```    Use Perl fork and the reset method

for( @files ) {

fork_me( \$cube );

\$cube->reset;

}

sub fork_me {

my(\$cube) = @_;

return if my \$pid = fork;

\$cube->rollup;
\$cube->commit( \$commit_target )

exit;

}```

### Discussion

```    No discussion.  This just works.

Just be sure you dont try to start a new commit if the last one is still working.

(That is, develop a locking system)```

## Commit some tables, report others

### Problem

```    You want to commit some of your tables to disk for long term storage,
but simply report the others.

Example:

my \$schema = DataCube::Schema->new;

Now imagine you wanted to commit the monthly numbers
while only reporting the daily numbers ```

### Solution

```    Use lazy_rollup and dispatch events as you see fit:

while( my \$next_cube = \$cube->lazy_rollup ) {

dispatch: {

\$next_cube->report( 'data_warehouse/reports' )
and last dispatch
if exists \$next_cube->schema->field_names->{day};

\$next_cube->commit( 'data_warehouse/archive' )
and last dispatch;

}

}```

### Discussion

```    This solution will commit the monthly numbers, but only report the daily numbers.

More complex dispatch mechanisms (including multiple dispatch) are certainly possible. ```

## ACID Compliant Transactions

### Problem

```    You want to achieve ACID compliance.
In particular, you want to start by ensuring Atomicity during the commit process.

For those not familiar with Atomicity, this means you want to
make sure a commit process happens entirely or not at all.

If your machine loses power in the middle of the commit process,
your data committed to date will not be corrupted. ```

### Solution

```    ACID compliant commits can be achieved like so:

require DataCube::FileUtils;
require DataCube::FileUtils::CubeMerger;

my @errors;

my \$commit_target      = 'data_warehouse/cubes/my_cube';
my \$transaction_target = 'data_warehouse/cubes/my_cube.temp';

mkdir(\$transaction_target)
or die "cant make dir:\n\$transaction_target\n\$!\n";

my \$utils  = DataCube::FileUtils->new;
my \$merger = DataCube::FileUtils::CubeMerger->new;

txn_begin:
{
eval {
while(my \$next_cube = \$cube->lazy_rollup) {
\$next_cube->commit(\$transaction_target);
}
\$merger->merge(
source => \$commit_target,
target => \$transaction_target,
);
};
push @errors, \$@ if \$@;

system("sync");

unless(\$? == 0){
push @errors, "sync returned a bad status: \$?\n";
}

}

txn_rollback:
{
last txn_rollback unless @errors;
eval {
};
my \$error = \$@ ? "there were rollback errors:\n\$@\n" : '';
die "txn_rollback complete\ncause of rollback:\n@errors\n\$error\n";
}

txn_finish:
{
my \$pending = "\$commit_target.pending";

eval {

rename(\$commit_target, \$pending)            or die "final-stage-1: \$!\n";
rename(\$transaction_target, \$commit_target) or die "final-stage-2: \$!\n";

# or you can keep the \$pending directory as an incremental 'checkpoint'

};

push @errors, \$@ if \$@;
die join("\n",@errors) if @errors;
}

# dying at final-stage-2 is the only place that
# requires manual intervention to fix a broken commit target.

return 1;```

### Discussion

```    This approach will take slightly longer than simply calling

\$cube->commit( \$commit_target );

However, your data will always be in a consistent state.

Notice that for the code in the solution to work, you must have the 'sync'
utility in your system and it must return 0 on succes. ```

# AUTHOR

David Williams, <david@namimedia.com>