David Williams > DataCube > DataCube::Schema

Download:
DataCube.tgz

Dependencies

Annotate this POD

CPAN RT

New  1
Open  0
View/Report Bugs
Source  

NAME ^

DataCube::Schema - An Object Oriented Perl Module for creating Snowflake Schemas.

SYNOPSIS ^

  use strict;
  use warnings;
  
  use DataCube::Schema;
  

  # the new constructor
  
  my $schema = DataCube::Schema->new;
    
  
  # basic: adding dimensions, hierarchies and measures
  
  $schema->add_dimension('country');
  $schema->add_dimension('product');
  $schema->add_dimension('salesperson');
  
  $schema->add_hierarchy('year','quarter','month','day');
  
  $schema->add_measure('sum','units_sold');
  $schema->add_measure('sum','dollar_volume');
  $schema->add_measure('average','price_per_unit');
  
    
  # advanced: adding strict dimensions / hierarchies
  
  $schema->add_strict_dimension('country');
  $schema->add_strict_hierarchy('year','quarter','month','day');
  
  
  # advanced: suppressing lattice points
    
  $schema->suppress_lattice_point('country','salesperson');

DESCRIPTION ^

This module provides a pure perl, object oriented, embeddable Star and Snowflake Schema engine. It is self contained and ready to use in data mining and data warehousing applications.

All schemas created by this module are Snowflake Schemas.

BACKGROUND

Star and Snowflake Schemas are used to organize dimensions and measurements in Data Warehouses. Snowflake Schemas encompass all the functionality of Star Schemas, and provide direct support for hierarchies.

STAR SCHEMAS

In a Star Schema, many peripheral tables of data are joined to one central table called the "Fact Table". Each peripheral table represents a single dimension.

SNOWFLAKE SCHEMAS

The Snowflake Schema is an extension of the Star Schema, in which each peripheral "Dimension Table" holding hierarchical data is replaced by a group of tables representing that hierarchy.

To illustrate the difference, consider a single table in a Star Schema which contains a date field of the form "year/month/day". In a Snowflake Schema, this table would be replaced by 3 tables: one containg the year, one containing the month, and one containing the day, all linked together by a primary key / foreign key relationship.

FACT TABLES

A special table called the "Fact Table" resides in the middle of both Star and Snowflake Schemas. Fact Tables contain a single row of data for each factual event logged during the course of business.

The Fact Table contains redundant and repetitious data (usually in Second Normal Form) and is therefore subject to update anamolies. For this reason, Snowflake Schemas should rarely be used for high performance Relational Databases. They should be used, however, to design "Dimensional Databases" and Data Warehouses, where such redundancy allows for extreme performance gains for complex sql queries, especially those containing aggregation functions on hierarchical relationships.

BASIC OPERATIONS ^

This module provides several methods to design Snowflake Schemas.

add_dimension

This method adds a single dimension to a schema.

add_measure

This method adds a single measures to the cubes measure table.

Supported measures inlcude:

    count
    min         [field name]
    max         [field name]
    sum         [field name]
    count       [field name]
    average     [field name]
    product     [field name]
    multi_count [field name]

Here is a description of each measure:

    count
    
    init_value     0
    update_rule    ++
    report_format  integer
    additivity     additive
    declaration    $schema->add_measure('count')
    description    the number of times a dimensional tuple has been inserted into the cube



    min
    
    init_value     undef
    update_rule    = if < or undefined
    report_format  decimal
    additivity     additive
    declaration    $schema->add_measure('min','field')
    description    the minimal value of inserted numbers from 'field'



    max
    
    init_value     undef
    update_rule    = if > or undefined
    report_format  decimal
    additivity     additive
    declaration    $schema->add_measure('max','field')
    description    the maximal value of inserted numbers from 'field'



    sum
    
    init_value     0
    update_rule    +=
    report_format  decimal
    additivity     additive
    declaration    $schema->add_measure('sum','field')
    description    the sum of inserted numbers from 'field'



    product
    
    init_value     1
    update_rule    *=
    report_format  decimal
    additivity     additive (ie separable)
    declaration    $schema->add_measure('product','field')
    description    the multiplication of inserted numbers from 'field'



    average
    
    init_value     0
    update_rule    {average}->{$field}->{sum_total}   += $field_value;
                   {average}->{$field}->{observations}++
    report_format  decimal (sum_total / observations)
    additivity     non-additive
    declaration    $schema->add_measure('average','field')
    description    the average of inserted values from 'field'



    count (distinct) 
    
    init_value     {} (empty hashref)
    update_rule    {count}->{$field}->{$field_value} = undef
    report_format  integer (ie scalar(keys(%{{count}->{$field}->{$field_value}})))
    additivity     non-additive
    declaration    $schema->add_measure('count','field')
    description    the count distinct of inserted values from 'field'



    multi_count (distinct with multiplicity) 
    
    init_value     {} (empty hashref)
    update_rule    {count}->{$field}->{$field_value}++
    report_format  integer (same as count)
    additivity     non-additive
    declaration    $schema->add_measure('multi_count','field')
    description    the count distinct of inserted values from 'field', also stores the *number of times* that $field_value was 'uniquefied'

add_hierarchy

This method adds a single hierarchy to a schema. Hierarchies are like Dimensions, except that aggregate measures will be computed on complete Parent - Child chains.

For example, consider the hierarchy "year", "month", "day" and the measure "sum" of "dollars".

The following code:

  $schema->add_measure('sum','dollars');
  $schema->add_hierarchy('year','month','day');
  
  $cube = DataCube->new($schema);

  # a bunch of data is fed to the cube
  #
  # [...]
  #
  # and some time later:

  $cube->rollup;
  $cube->report;

will create the following reports:

  1.  sum_of_dollars
  2.  sum_of_dollars by year
  3.  sum_of_dollars by year, month
  4.  sum_of_dollars by year, month, day 

as it probably should.

ADVANCED OPERATIONS ^

add_strict_dimension

This method adds a single dimension to a schema, over which no aggregation will be performed.

For example, consider the following code:

  $schema->add_dimension('product');
  $schema->add_strict_dimension('country');

  $schema->add_measure('sum','dollars');
  
  $cube = DataCube->new($schema);

  # a bunch of data is fed to the cube
  #
  # [...]
  #
  # and some time later:

  $cube->rollup;
  $cube->report;

will create the following reports:

  1.  sum_of_dollars by country
  2.  sum_of_dollars by country, product

Notice that the datacube did not produce the sum_of_dollars irrespective of country.

add_strict_hierarchy

This method adds a single hierarchy to a schema. No aggregation will be performed over the top-most dimension.

suppress_lattice_point

This method suppresses specific rollups / reports from being created during a call to rollup, which may lead to a saving of both time and space.

For example, consider the following code:

  $schema->add_measure('sum','dollars');
  $schema->add_hierarchy('year','month','day');
  $schema->suppress_lattice_point('year','month');
  
  $cube = DataCube->new($schema);

  # a bunch of data is fed to the cube
  #
  # [...]
  #
  # and some time later:

  $cube->rollup;
  $cube->report;

will create the following reports:

  1.  sum_of_dollars
  2.  sum_of_dollars by year
  3.  sum_of_dollars by year, month, day 

assert_lattice_point

This method restricts a datacube to only the specified list of dimensions during rollup.

This method superscedes all others except for add_strict_dimension and add_strict_hierarchy, and may lead to a saving of both time and space.

For example, consider the following code:

  $schema->add_measure('sum','dollars');
  $schema->add_hierarchy('year','month','day');
  
  $schema->assert_lattice_point('overall');
  $schema->assert_lattice_point('year','month');
  
  $cube = DataCube->new($schema);

  # a bunch of data is fed to the cube
  #
  # [...]
  #
  # and some time later:

  $cube->rollup;
  $cube->report;

will create the following reports:

  1.  sum_of_dollars
  2.  sum_of_dollars by year, month

If you do this:

  $schema->add_measure('sum','dollars');
  $schema->add_strict_hierarchy('year','month','day');
  
  $schema->assert_lattice_point('overall');
  $schema->assert_lattice_point('year','month');

you will not get the report

  1.  sum_of_dollars

because the method call

  $schema->add_strict_hierarchy('year','month','day');

confines 'year' to always be present.

When in doubt, do not use 'assert_lattice_point' in the presence of the other lattice assertions (such as 'strict' and 'suppress').

confine_to

This method restricts a datacube to only the specified list of dimensions and superscedes all other methods.

The base table becomes fixed to the confined point and no rollup occurs even if called.

For example, consider the following code:

    my $schema = DataCube::Schema->new;
    $schema->add_dimension('country');
    $schema->add_dimension('product');
    $schema->add_dimension('salesperson');
    
    $schema->add_hierarchy('year','quarter','month','day');
    
    $schema->add_measure('sum','units_sold');
    $schema->add_measure('sum','dollar_volume');
    $schema->add_measure('average','price_per_unit');
    
    $schema->confine_to('country','product','year');
    
    my $cube = DataCube->new($schema);

will create a cube with only one table (the base table: 'country','product','year') and only one report:

  1.  sum_of_dollars etc. by country, product, year

EXPORT ^

None

SEE ALSO ^

Wikipedia on Snowflake Schema:

http://en.wikipedia.org/wiki/Snowflake_schema

AUTHOR ^

David Williams, <david@namimedia.com>

COPYRIGHT AND LICENSE ^

Copyright (C) 2009 by David Williams

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

syntax highlighting: