The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

EXAMPLE ETL SETUP

In this example, we will unzip, validate and load free five minute price files from the [http://www.electricityinfo.co.nz/comitFta/ftapage.main WITS free to air site]. As noted in the introduction, we try and move as much of the application functionality to the framework as possible. The ETLp framework will execute each phase one after the other.

MODES

The Pipeline has two modes - iterative and serial specified with the type parameter.

Iterative Mode

In the iterative mode, the framework will iterate over a number of files, calling each phase in turn, once for each file.

Serial Mode

With the serial mode, each phase is executed once.

EXAMPLE: LOADING THE WITS FREE TO AIR FIVE MINUTE PRICE FILES

In this example, because we are processing a series of files, we wll be running an iterative pipeline. We start by creating a config file:

    <five_min_prices>
        type               = iterative

        <config>
            filename_format    = (5_minute_prices.*\.csv)(?:\.gz)?$
            incoming_dir       = data/incoming
            archive_dir        = data/archive
            fail_dir           = data/fail
            table_name         = five_minute_prices
            controlfile_dir    = conf/control
            controlfile        = five_minute_prices.ctl
            on_error           = die
            next               = wits price_datamart
        </config>

        <pre_process>
            <item>
                name     = decompress
                type     = gunzip
            </item>
            <item>
                name        = validate price file
                type        = validate
                file_type   = csv
                skip        = 1
            </item>
        </pre_process>
        <process>
            <item>
                name = load price file
                type = csv_loader
                skip = 1
            </item>
            <item>
                name = price_proc
                type = plsql
                <parameters>
                    name  = filename
                    value = %basename(filename)%
                </parameters>
                <parameters>
                    name  = message
                    value = %message%
                </parameters>
            </item>
        </process>
        <post_process>
            <item>
                name=compress price file
                type=gzip
            </item>
        </postprocess>
    </five_min_prices>

The file uses the Apache-style syntax. It is similar to XML in that it allows nested sections, but parameters are specified using name value pairs. Our outmost tags, "five_min_prices" denote a section within the configuration file. If we grabbed more than one file type from the WITS site, we would add each type to its own section within the configuration file

The file is divided into two categories:

  • The configuration section (config) sets up our environment.

  • The processing sections do the actual processing.

The Environment Configuration

The Standard ETLp gzip, gunzip, validation and load functions require specific configuration values:

  • filename_format is a reqular expression that will be used to find the files we wish to load. The file may have a .gz extension.

    • Note the parentheses around (5_minute_prices.*\.csv). The canonical name of the file (the name when all other extraneous information such as the compressed extension) must be specified within parenthesis. This applies even to files whether there is no transformation of the name.

  • incoming_dir specifies where the files will be found.

    • If the path is relative (i.e. no leading /), then the path will be relative to the application root directory - i.e. one directory above the etlp script.

    • If the path has a leading slash, then it is an absolute specification, and the data does not have top reside under the application tree.

  • archive_dir specifies where successfully loaded files will be moved to.

  • fail_dir is where files that fail processing are moved to.

  • table_name defines the table that the data will be loded into.

  • controlfile is the name of the definition file that specifies the definition of the source files to be processed. It can also have rules for each field that will define allowable values.

  • controlfile_dir is the directrory where the controlfile is found.

  • on_error specifes a default action to be taken whenever an error is encountered. This can be over-ridden at the individual processing item level. The allowable values are

    • die: stop all processing.

    • skip: stop futher processing of the current file, and move onto the next one.

    • ignore: ignore the error and continue with further processing.

Our Controlfile

This is the controlfile we use to specify the file definition:

    grid_point     N varchar(8)
    trading_date   N date(%d/%m/%Y)
    trading_period N integer;range(1,50)
    market_time    N qr/^(20|21|22|23|[01]\d|\d)(([:][0-5]\d){1,2})$/
    price          N float;range(0,)
    island         N varchar(2)
    area           N varchar(2)
    market_flag    N varchar(1)
    runtime        N date(%d/%m/%Y %H:%M:%S)

The first field defines the field name. The second field specifies whether the field is nullable or not. The third field specifies the validation rules. There can be multiple rules per field, with each rule separated by a semi-colon.

  • grid_point. Can contain up to 8 characters.

  • trading_date. A date, using the specified POSIX date format.

  • trading_period. (1) An integer. (2) The minimum value is 1 and maximum value is 50.

  • market_time. A valid time using a 24 hour clock. Note if none of the predefined validations suit our needs, we can defined custom regular expressions to validate the data.

  • price. (1) A floating point number. (2) The minimum value is 0 and there is no maximum value.

  • island. A two character field.

  • area. A two character field.

  • market_flag. A single character field.

  • run_time. A date, using the specified POSIX date format.

The field names should match the corresponding column names in the table. The table will require one extra integer field called file_id. This allows an individual file to be tracked throughout it's rocessing history.

Validation Errors

If any records fail validation, the errors are logged in the audit tables and are also emailed:

    ERROR - Error processing /home/etl/data/incoming/5_minute_prices_WWD1103_20100609.csv: 5_minute_prices_WWD1103_20100609.csv failed validation:
    Line number: 13
           field name:island
           field value:NNI
           error:Length must be less than or equal to 2 characters
    Line number: 30
           field name:trading_date
           field value:09/13/2010
           error:Invalid date for pattern: %d/%m/%Y

The Processing Configuration

All processing belongs to one of three phases, where each phase can consist of one or more items, which are run sequentially. Not all phases are required.

Pre Process Phase

  • name: decompress

    • type: gunzip

    • description: Decompress the current file if it is gzipped.

  • name: validate price file

    • type: validate

    • description: Validate the file against the file rules (see above).

Process Phase

  • name: load price file

    • type: csv_loader

    • description: Loads the data into the table.

  • name: price_proc

    • type: plsql

    • description: An Oracle stored procedure

Post process Phase

  • name: compress

    • type: gzip

    • description: The processed file is compressed using the gzip algorithm.

Notes

  • The validation and csv_loader items both have the setting "skip = 1". This is because the data files have a header record and we skip over one line.

  • If csv_loader completes successfully, the file is moved to the archive_dir.

  • If any of the processing steps raise an error, the file is moved to the fail_dir,

  • The PL/SQL call is the only invocation of custom code. Everything else is handled automatically by the framework.

  • The framework removes the burden of capturing the results and audit information from the developer. It records when processes are run, how long they took and what the results were. Any errors are captured and recorded and can be viewed in the audit browser. Any errors or warnings are automatically mailed to the email address defined in the environment configuration file.

INVOKING THE ETL PROCESS

Invoking the ETL is simple. Call the following from either a scheduler or the command line, making sure the OS environment (ORACLE HOME, etc) is configured:

    etlp <config_file> <section>

e.g.

    etlp wits five_min_prices

CHAINING PIPELINES

The next parameter in the top section of the configuration specifies the next pipeline to be called on the completion of the current one. i.e.

    next = wits price_datamart

will invoke the call

    etlp wits price_datamart