ApacheLog::Compressor - convert Apache / CLF log files into a binary format for transfer
version 0.005
use ApacheLog::Compressor; use Sys::Hostname qw(hostname); # Write all data to bzip2-compressed output file open my $out_fh, '>', 'compressed.log.bz2' or die "Failed to create output file: $!"; binmode $out_fh; my $zip = IO::Compress::Bzip2->new($out_fh, BlockSize100K => 9); # Provide a callback to send data through to the file my $alc = ApacheLog::Compressor->new( on_write => sub { my ($self, $pkt) = @_; $zip->write($pkt); } ); # Input file - normally use whichever one's just been closed + rotated open my $fh, '<', '/var/log/apache2/access.log.1' or die "Failed to open log: $!"; # Initial packet to identify which server this came from $alc->send_packet('server', hostname => hostname(), ); # Read and compress all the lines in the files while(my $line = <$fh>) { $alc->compress($line); } close $fh or die $!; $zip->close; # Dump the stats in case anyone finds them useful $alc->stats;
Converts data from standard Apache log format into a binary stream which is typically 20% - 60% the size of the original file. Intended for cases where log data needs transferring from multiple high-volume servers for analysis (potentially in realtime via tail -f).
The log format is a simple dictionary replacement algorithm: each field that cannot be represented in a fixed-width datatype is replaced with an indexed value, allowing the basic log line packet to be fixed size with additional packets containing the first instance of each variable-width data item.
Example:
api.example.com 105327 123.15.16.108 - apiuser@example.com [19/Dec/2009:03:12:07 +0000] "POST /api/status.json HTTP/1.1" 200 80516 "-" "-" "-"
The duration, IP, timestamp, method, HTTP version, response and size can all be stored as 32-bit quantities (or smaller), without losing any information. The vhost, user and URL are extracted to separate packets, since we expect to see them at least twice on a typical server.
This would be converted to:
vhost packet - api.example.com assigned index 0
user packet - apiuser@example.com assigned index 0
url packet - /api/status.json assigned index 0
timestamp packet - since a busy server is likely to have several requests a second, there's a tiny saving to be had by sending this only when the value changes, so we push this into a separate packet as well.
log packet - actual data, binary encoded.
The following packet types are available:
00 - Log entry
01 - Change server
02 - timestamp
03 - vhost
04 - user
05 - useragent
06 - referer
07 - url
80 - reset
The log entry itself normally consists of the following fields:
N vhost N time N IP N user N useragent N timestamp C method C version n response N bytes N url
The format of the log file can be customised, see the next section for details.
A custom format can be provided as the format parameter when instantiating a new ApacheLog::Compressor object via ->"new". This format consists of an arrayref of key/value pairs, each value holding the following information:
format
id - the ID to use when sending packets
type - pack format specifier used when storing and retrieving the data, such as N1 or n1. Without this there will be no entry for the item in the compressed log stream
regex - the regular expression used for matching this part of the log file. The final regex will be the concatenation of all regex entries for the format, joined using \s+ as the delimiter.
process_in - coderef for converting incoming values from a plain text log source into compressed values, will receive $self (the current ApacheLog::Compressor instance) and $data (the current hashref containing the raw data).
process_out - coderef for converting values from a compressed source back to plain text, will receive $self (the current ApacheLog::Compressor instance) and $data (the current hashref containing the raw data).
Instantiate the class.
Takes the following named parameters:
on_write - coderef to call with packet data for each outgoing packet
Returns the default format used for parsing log lines.
This is an arrayref containing key => value pairs, see "FORMAT SPECIFICATION" for more details.
Refresh the mapping from format keys and internal definitions.
Returns the index for the given type and value, generating a packet if no previous value was found.
Read a value from the cache, for expanding compressed log format entries.
Set a cache index key to a value when expanding a packet stream.
General compression function. Given a line of data, sends packets as required to transmit that information.
Generate and send a packet for the given type.
Generate a reset packet and clear internal caches in the process.
Generate a server packet.
Generate the timestamp packet.
Write a packet to the output handler.
Expand incoming data.
Handle an incoming reset packet.
Handle an incoming log packet.
Convert logline data to a hashref.
Internal method for converting the current log entry to a text string in something approaching the 'standard' Apache log format (almost, but not quite, CLF).
Internal method for processing a server record (used to indicate the server name subsequent records apply to).
Internal method for processing a timestamp entry.
Internal method for invoking an event.
Print current stats - not all that useful since we clear cached values regularly.
Tom Molesworth <cpan@entitymodel.com>
Copyright Tom Molesworth 2009-2011. Licensed under the same terms as Perl itself.
To install ApacheLog::Compressor, copy and paste the appropriate command in to your terminal.
cpanm
cpanm ApacheLog::Compressor
CPAN shell
perl -MCPAN -e shell install ApacheLog::Compressor
For more information on module installation, please visit the detailed CPAN module installation guide.