Mahiro Ando > Data-Freq-0.04 > Data::Freq

Download:
Data-Freq-0.04.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 0.04   Source   Latest Release: Data-Freq-0.05

NAME ^

Data::Freq - Collects data, counts frequency, and makes up a multi-level counting report

VERSION ^

Version 0.04

SYNOPSIS ^

    use Data::Freq;
    
    my $data = Data::Freq->new('date');
    
    while (my $line = <STDIN>) {
        $data->add($line);
    }
    
    $data->output();

DESCRIPTION ^

Data::Freq is an object-oriented module to collect data from log files or any kind of data sources, count frequency of particular patterns, and generate a counting report.

See also the command-line tool data-freq.

The simplest usage is to count lines of a log files in terms of a particular category such as date, username, remote address, and so on.

For more advanced usage, Data::Freq is capable of aggregating counting results at multiple levels. For example, lines of a log file can be grouped into months first, and then under each of the months, they can be further grouped into individual days, where all the frequency of both months and days is summed up consistently.

Analyzing an Apache access log

The example below is a copy from the "SYNOPSIS" section.

    my $data = Data::Freq->new('date');
    
    while (my $line = <STDIN>) {
        $data->add($line);
    }
    
    $data->output();

It will generate a report that looks something like this:

    123: 2012-01-01
    456: 2012-01-02
    789: 2012-01-03
    ...

where the left column shows the number of occurrences of each date.

The date/time value is automatically extracted from the log line, where the first field enclosed by a pair of brackets [...] is parsed as a date/time text by the Date::Parse::str2time() function. (See Date::Parse.)

See also "logsplit" in Data::Freq::Record.

Multi-level counting

The initialization parameters for the new() method can be customized for a multi-level analysis.

If the field specifications are given, e.g.

    Data::Freq->new(
        {type => 'date'},           # field spec for level 1
        {type => 'text', pos => 2}, # field spec for level 2
    );
    # assuming the position 2 (third portion, 0-based)
    # is the remote username.

then the output will look like this:

    123: 2012-01-01
        100: user1
         20: user2
          3: user3
    456: 2012-01-02
        400: user1
         50: user2
          6: user3
    ...

Below is another example along this line:

    Data::Freq->new('month', 'day');
        # Level 1: 'month'
        # Level 2: 'day'

with the output:

    12300: 2012-01
          123: 2012-01-01
          456: 2012-01-02
          789: 2012-01-03
          ...
    45600: 2012-02
          456: 2012-02-01
          789: 2012-02-02
        ...

See "field specification" for more details about the initialization parameters.

Custom input

The data source is not restricted to log files. For example, a CSV file can be analyzed as below:

    my $data = Data::Freq->new({pos => 0}, {pos => 1});
    # or more simply, Data::Freq->new(0, 1);
    
    open(my $csv, 'source.csv');
    
    while (<$csv>) {
        $data->add([split /,/]);
    }

Note: the add() method accepts an array ref, so that the input does not have to be split by the default "logsplit" in Data::Freq::Record function.

For more generic input data, a hash ref can also be given to the add() method.

E.g.

    my $data = Data::Freq->new({key => 'x'}, {key => 'y'});
    # Note: keys *cannot* be abbrebiated like Data::Freq->new('x', 'y')
    
    $data->add({x => 'foo', y => 'abc'});
    $data->add({x => 'bar', y => 'def'});
    $data->add({x => 'foo', y => 'ghi'});
    $data->add({x => 'bar', y => 'jkl'});
    ...

In the field specifications, the value of pos or key can also be an array ref, where the multiple elements selected by the pos or key will be join'ed by a space (or the value of $").

This is useful when a log format contains a date that is not enclosed by a pair of brackets [...].

E.g.

    my $data = Data::Freq->new({type => 'date', pos => [0..3]});
    
    # Log4x with %d{dd MMM yyyy HH:mm:ss,SSS}
    $data->add("01 Jan 2012 01:02:03,456 INFO - test log\n");
    
    # pos 0: "01"
    # pos 1: "Jan"
    # pos 2: "2012"
    # pos 3: "01:02:03,456"

As a result, "01 Jan 2012 01:02:03,456" will be parsed as a date string.

Custom output

The output() method accepts different types of parameters as below:

The output does not include the grand total by default. If the with_root option is set to a true value, the total count will be printed as the first line (level 0), and all the subsequent levels will be shifted to the right.

The transpose option flips the order of the count and the value in each line. E.g.

    2012-01: 12300
        2012-01-01: 123
        2012-01-02: 456
        2012-01-03: 789
        ...
    2012-02: 45600
        2012-02-01: 456
        2012-02-02: 789
        ...

The indent unit (repeated appropriate times) and the separator (between the count and the value) can be customized with the respective options, indent and separator.

The default output format has apparent ambiguity between the indent and the padding for alignment.

For example, consider the output below:

    1200000: Level 1
         900000: Level 2
             900000: Level 3
              5: Level 2
    ...

where the second "Level 2" appears to have a deeper indent than the "Level 3."

Although the positions of colons (:) are consistently aligned, it may seem to be slightly inconsistent.

The indent depth will be clearer if a prefix is added:

    $data->output({prefix => '* '});
    
    * 1200000: Level 1
        *  900000: Level 2
            *  900000: Level 3
        *       5: Level 2
    ...

Alternatively, the no_padding option can be set to a true value to disable the left padding.

    $data->output({no_padding => 1});
    
    1200000: Level 1
        900000: Level 2
            900000: Level 3
        5: Level 2
    ...

Field specification

Each argument passed to the new() method is passed to the "new" in Data::Freq::Field method.

For example,

    Data::Freq->new(
        'month',
        'day',
    );

is equivalent to

    Data::Freq->new(
        Data::Freq::Field->new('month'),
        Data::Freq::Field->new('day'),
    );

and because of the way the argument is interpreted by the Data::Freq::Field class, it is also equivalent to

    Data::Freq->new(
        Data::Freq::Field->new({type => 'month'}),
        Data::Freq::Field->new({type => 'day'}),
    );

If the type parameter is either text or number, the results are sorted by count in the descending order by default (i.e. the most frequent value first).

For the date type, the sort parameter defaults to value, and the order parameter defaults to asc (i.e. the time-line order).

Frequency tree

Once all the data have been collected with the add() method, a frequency tree has been constructed internally.

Suppose the Data::Freq instance is initialized with the two fields as below:

   my $field1 = Data::Freq::Field->new({type => 'month'});
   my $field2 = Data::Freq::Field->new({type => 'text', pos => 2});
   my $data = Data::Freq->new($field1, $field2);
   ...

a result tree that looks like below will be constructed as each data record is added:

     Depth 0            Depth 1             Depth 2
                        $field1             $field2

    {432: root}--+--{123: "2012-01"}--+--{10: "user1"}
                 |                    +--{ 8: "user2"}
                 |                    +--{ 7: "user3"}
                 |                    ...
                 +--{135: "2012-02"}--+--{11: "user3"}
                 |                    +--{ 9: "user2"}
                 |                    ...
                 ...

In the diagram, a node is represented by a pair of braces {...}, and each integer value is the total number of occurrences of the node value, under its parent category.

The root node maintains the grand total of records that have been added.

The tree structure can be recursively visited by the traverse() method.

Below is an example to generate a HTML:

    print qq(<ul>\n);
    
    $data->traverse(sub {
        my ($node, $children, $recurse) = @_;
        
        my ($count, $value) = ($node->count, $node->value);
            # HTML-escape $value if necessary
        
        print qq(<li>$count: $value);
        
        if (@$children > 0) {
            print qq(\n<ul>\n);
            
            for my $child (@$children) {
                $recurse->($child); # invoke recursion
            }
            
            print qq(</ul>\n);
        }
        
        print qq(</li>\n);
    });
    
    print qq(</ul>\n);

METHODS ^

new

Usage:

    Data::Freq->new($field1, $field2, ...);

Constructs a Data::Freq object.

The arguments $field1, $field2, etc. are instances of Data::Freq::Field, or any valid arguments that can be passed to "new" in Data::Freq::Field.

The actual data to be analyzed need to be added by the add() method one by one.

The Data::Freq object maintains the counting results, based on the specified fields. The first field ($field1) is used to group the added data into the major category. The next subsequent field ($field2) is for the sub-category under each major group. Any more subsequent fields are interpreted recursively as sub-sub-category, etc.

If no fields are given to the new() method, one field of the text type will be assumed.

add

Usage:

    $data->add("A record");
    
    $data->add("A log line text\n");
    
    $data->add(['Already', 'split', 'data']);
    
    $data->add({key1 => 'data1', key2 => 'data2', ...});

Adds a record that increments the counting by 1.

The interpretation of the input depends on the type of fields specified in the new() method. See "evaluate_record" in Data::Freq::Field.

output

Usage:

    # I/O
    $data->output();      # print results (default format)
    $data->output(\*OUT); # print results to open handle
    $data->output($io);   # print results to IO::* object
    
    # Callback
    $data->output(sub {
        my $node = shift;
        # $node is a Data::Freq::Node instance
    });
    
    # Options
    $data->output({
        with_root  => 0   , # if true, prints total at root
        transpose  => 0   , # if true, prints values before counts
        indent     => '  ', # repeats (depth - 1) times
        separator  => ': ', # separates the count and the value
        prefix     => ''  , # prepended before the count
        no_padding => 0   , # if true, disables padding for the count
    });
    
    # Combination
    $data->output(\*STDERR, {opt => ...});
    $data->output($open_fh, {opt => ...});

Generates a report of the counting results.

If no arguments are given, default format results are printed out to STDOUT. Any open handle or an instance of IO::* can be passed as the output destination.

If the argument is a subroutine ref, it is regarded as a callback that will be called for each node of the frequency tree in the depth-first order. (See "frequency tree" for details.)

The following arguments are passed to the callback:

traverse

Usage:

    $data->traverse(sub {
        my ($node, $children, $recurse) = @_;
        
        # Do something with $node before its child nodes
        
        # $children is a sorted list of child nodes,
        # based on the field specification
        for my $child (@$children) {
                $recurse->($child); # invoke recursion
        }
        
        # Do something with $node after its child nodes
    });

Provides a way to traverse the result tree with more control than the output() method.

A callback must be passed as an argument, and will ba called with the following arguments:

When the traverse() method is called, the root node is passed as the $node parameter first. Until the $recurse subroutine is explicitly invoked for the child nodes, no recursion will be invoked automatically.

root

Returns the root node of the frequency tree. (See "frequency tree" for details.)

The root node is created during the new() method call, and maintains the total number of added records and a reference to its direct child nodes for the first field.

fields

Returns the array ref to the list of fields (Data::Freq::Field).

The returned array should not be modified.

AUTHOR ^

Mahiro Ando, <mahiro at cpan.org>

BUGS ^

Please report any bugs or feature requests to bug-data-freq at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Data-Freq. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT ^

You can find documentation for this module with the perldoc command.

    perldoc Data::Freq

You can also look for information at:

ACKNOWLEDGEMENTS ^

LICENSE AND COPYRIGHT ^

Copyright 2012 Mahiro Ando.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

syntax highlighting: