logstatsd - generate summary statistics from log files
logstatsd [OPTIONS] logstatsd -f status:0 -f duration:5 -l /path/to/logfile --xml logstatsd -f status:0 -f duration:5 -l /path/to/logfile --xml /path/to/report.xml # help message describing options logstatsd --help # full man page on logstatsd logstatsd --help -v # for more examples and explanations, see the EXAMPLES section below.
Monitoring an application frequently involves monitoring it's log file(s). Log files may contain hundreds or thousands of events per minute. Parsing the entire log file can be a very cpu intensive task making near-real-time reporting or monitoring difficult to impossible.
logstatsd was designed to help with these problems while being extremely simple to use and configure. logstatsd can run as a daemon, monitoring entries as they enter the log, and store summary data in memory. logstatsd can then be signaled to export current summary data for populating an RRD or feeding data to a monitoring application.
By default, the log format is assumed to be comma separated values, but an alternate regexp may be specified. Summary statistics (e.g. count) are collected about fields that you find interesting, e.g. by transaction name, status, duration, date/time, end user locations, back end server names, etc. For example, if a transaction field is specified, the number of hits for each unique transaction will be counted. If a duration field is defined, then information about average response times broken down by each field specified.
Summary data may also be cross tabulated on two or more fields, henceforth referred to as field groupings. For example, if you collect summary statistics about transaction name grouped with the status, information will be collected about the numbers of success and failures of each transaction. If you collect summary statistics about status grouped with time, you could then collect statistics about the successful and unsuccessful transactions per minute.
Also, thresholds may be defined to categorize response times (see THRESHOLDS section below).
logstatsd is designed to run as a daemon on the server where the log file resides. When run in daemon mode, it will tail the log file and process new entries as they arrive in the log. Summary data may be extracted by sending a "kill -USR1" to the logstatsd process id.
Data can be exported to an xml report, or to a script that can be used to populate a RRD.
logstatsd is designed to parse formatted data in log files. Unlike other log processing tools which run a series of regexps on each log entry and count each match, logstatsd splits each entry into a series of fields using a single regexp. This makes it useful for files like an apache access log or CSV files, but less useful for files with less predicatble contents like an apache error log.
The following options are supported by this command:
Specify log file to be summarized.
Specify a field from the log that should be summarized. Multiple field options may be specified. The index for the first column should be 0.
For example, if your file is a csv, and the first column is "status", the field definition would be -field status:0.
If a duration field was specified, thresholds can be associated with the durations (see THRESHOLDS below).
Field names should not contain dashes.
Define two fields which should be grouped for summary statistics. Multiple groups options may be specified.
For example, you might want to keep statics about each transaction based on status. In this case, you can simply use the options "-groups transaction:status".
Note that order is important for display purposes. transaction:status would display each transaction, and then each status for the transaction. status:transaction will display each status, and then list each transaction with the associated status.
For display purposes, it will always look better when you use the field which has the least number of possible values first.
Log::Statistics will handle groups with any number of members, but at this point logstatsd will only handle groups with two or three fields.
Specify the regexp used to parse the time field, if specified. The regexp should include a single capture expression, which when run on the dat field, will return the date and time.
Ideally you should attempt to capture the year, month, day, hour, and minute. Do not capture seconds unless you really want summary data broken down per second.
Specify regexp used to parse the entire log entry. The regexp should capture each field in the log, which can then be referenced using the usual column number. For a simple silly example,
This would capture the first three comma-separated fields from the log entry, and make them available as column number 0, 1, and 2.
Display version information.
Specify location of config file. A config file is a convenient way to store default information about a type of logfile. For example, create a section called "mylog" that contains your field definitions and time regexp:
[mylog] time_regexp = (\d\d\d\d\/\d\d\/\d\d\s\d\d\:\d\d)\: field_list =<<EOF status:0 type:1 system:2 transaction:3 duration:5 time:7 EOF
Then, from the command line, simply specify the config file and the section "mylog", and you can reference fields by name without having to specify the column number:
logstatsd -l /path/to/logfile --xml - -f transaction --group transaction:status
Specify section to be read from config file.
Generate an xml file containing all currently captured summary data. If "-" is specified, the xml will be printed to stdout.
Experimental. Generate a shell script of "rrdtool update" commands to update a set of rrd files. RRD files are not generated directly at this time, since the script is much more efficient for transport to a centralized monitoring server and updating rrd files there. If "-" is specified, the rrd commands will be printed to stdout.
Once in daemon mode, RRD commands will be generated using the current time stamp and currently available summary data.
To specify which counters should be used to build rrd files, use the -rrdupdate option.
If the -rrd option is used with --report, and if a "time" field and a time-regexp were both defined, then times will be parsed from the logs, and "rrd update" commands will be generated for each minute. Currently this behaviour is only available for the total summary data and not for any defined rrdupdate fields.
Note that currently all RRDs assume that you have defined 4 thresholds. If you define less thresholds, your RRDs will be a little larger than necessary. If you define more, your RRDs will only track the first 4.
The following options may be used for offline reports
Read the entire log and generate a single report. Useful for generating offline reports.
May only be used in combination with --report. Specify a series of servers on which the log file resides. For each server, files will be read using ssh and cat (or gzcat if file ends in .gz). The log path must be identical on each server, although globs are allowed in the log path.
May only be used with --servers. Specify a command that should be run on each target server to extract the appropriate records from the log. The string '$logfile' will be replaced with the actual logfile name. Example usage:
--ssh-prefilter='gzgrep -h myTransactionName $logfile'
This can dramatically reduce cpu and bandwidth utilization.
Despite the similar name, this is not to be combined with the --ssh option.
The following options may be used for near-real-time reporting.
Enable daemon mode. In daemon mode, the log file will be opened in tail mode (using File::Tail). Each new line that arrives in the log file will be processed. Data may be obtained from the running daemon by sending a USR1 signal (kill -USR1 <pid>).
When combined with -r, the entire log file will be read before opening in tail mode. As this is still a bit of a prototype, the log file is actually opened and read, then closed, and then opened again using File::Tail. This leaves a short window where some log entries may not be processed.
Experimental. May only be used with --daemon. Specify the remote server on which the log file lives.
When using this option, you should install Craig H. Rowland's program 'logtail' on the target server, and specify the location using the logtail config param. Using the ssh option without the logtail option may be unstable and is not recommended.
Experimental. Can only be used with -ssh. Specify the path to the logtail program written in C by Craig H. Rowland. From the logtail documentation:
This program will read in a standard text file and create an offset marker when it reads the end. The offset marker is read the next time logtail is run and the text file pointer is moved to the offset location. This allows logtail to read in the next lines of data following the marker. This is good for marking log files for automatic log file checkers to monitor system events.
Note that on the first processing of a new file using logtail, all log entries will be read in and processed. On subsequent restarts, logtail will only process lines not previously seen.
It is recommended that you also define the config param logtail_offset in your config file to specify the location of the offset file created by logtail. If this option is not defined, logtail will create a number of offset files.
Specify the rrd databases that should be updated when running in daemon mode. Any number of rrdupdate options may be specified.
The fields in this option specify keys used to look up the option in the internal group data. To look up a *field* directly, use the definition "fields|fieldname|fieldvalue". For example, if you specified a field called "status", you can build an RRD from all entries with status "SUCCESS" by using this rrdupdate definition:
In order to track *group* fields (i.e. those specified with -group), use the definition "groups|name1-name2|value1|value2". For example, if you are grouping status by transaction, to build RRDs for all transactions with status FAIL and name mytrans.do, use this:
Thresholds allow you to create categories of long response times and report data on those categories. For example, a given transaction might be expected to be complete within 5 seconds. In addition to measuring the average response time of the transaction, you may also wish to measure how many transactions are not completed within 5 seconds. You may define any number of categories, so you could measure those that you consider to be fast (under 3 seconds), good (under 5 seconds), slow (over 10 seconds), and very slow (over 20 seconds).
NOTE: If a duration field was not defined, then response times thresholds statistics can not be calculated.
The config file is a simple .ini style config file. Here is an example config file:
[test] time_regexp = (\d\d\d\d\/\d\d\/\d\d\s\d\d\:\d\d)\: xml = /Users/wu/tmp/test.xml logfile = /Users/wu/projects/logs/test.log.mini field_list =<<EOF status:0 type:1 system:2 transaction:3 duration:5 time:7 EOF rrdupdate =<<EOF fields|status|GOOD fields|status|BAD groups|status-transaction|BAD|mytrans1 groups|status-transaction|GOOD|mytrans2 EOF rrd_step = 60 rrd_create =<<EOF DS:duration:COUNTER:1200:0:5000 DS:hits:COUNTER:1200:0:5000 DS:over1:COUNTER:1200:0:5000 DS:over2:COUNTER:1200:0:5000 DS:over3:COUNTER:1200:0:5000 DS:over4:COUNTER:1200:0:5000 RRA:AVERAGE:0.5:1:1440 RRA:AVERAGE:0.5:5:1440 RRA:AVERAGE:0.5:30:1440 RRA:AVERAGE:0.5:120:144 EOF
Most params can be defined in the config file or on the command line. Params on the command line override those in the config file.
A full explanation of any configuration system(s) used by the module, including the names and locations of any configuration files, and the meaning of any environment variables or properties that can be set. These descriptions must also include details of any configuration language used. (See also "Configuration Files" in Chapter 19.)
# generate an xml report from a CSV file, column 1 contains status, # and column 6 contains duration. generate an xml report of number # of responses and average response time data for each status. logstatsd -r -f status:0 -f duration:5 -l /path/to/logfile --xml - # generate an xml report from a CSV file, column 1 contains status, # and column 6 contains duration. generate an xml report of number # of responses and average response time data for each status, # including the number of responses that were under 5 seconds, those # that were between 5-10 seconds, 10-20 seconds, and over 20 # seconds. logstatsd -r -f status:0:5|10|20 -f duration:5 -l /path/to/logfile --xml - # generate an xml report from a CSV file. Column 1 contains status, # column 3 contains transaction name, and column 6 contains # duration. generate an xml report of responses for each status, # for each transaction, and also break down response data for each # transaction based on status. logstatsd -r -f transaction:3 -f status:0 -f duration:5 --group status:transaction -l /path/to/logfile --xml - # monitor CSV file for new incoming hits. generate an xml report on # "kill -USR1 <logstats pid>" logstatsd -d -f status:0 -f duration:5 -l /path/to/logfile --xml /path/to/report.xml # monitor CSV file for new incoming hits. generate a script to # update an RRD database on receipt of "kill -USR1 <logstats pid>" logstatsd -d -f status:0 -f duration:5 -l /path/to/logfile --rrd /path/to/rrd_script.sh # parse entire CSV file, and then begin monitoring for incoming # hits. generate xml report on completion of full parsing, and then # update on each receipt of "kill -USR1 <logstats pid>" logstatsd -r -d -f status:0 -f duration:5 -l /path/to/logfile --xml /path/to/report.xml
Benchmark - generating stats about long parsing times
File::Tail - for monitoring incoming data in a log
Config::IniFiles - for parsing the logstatsd.conf config file
Log::Log4perl - logging
Log::Statistics - logstatsd comes bundled with Log::Statistics, available from CPAN
Getopt::Long - command line options processing
Pod::Usage - for command line help
There are no known bugs in this script. Please report problems to VVu@geekfarm.org
Patches are welcome.
Copyright (c) 2006, VVu@geekfarm.org All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
- Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
- Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
- Neither the name of geekfarm.org nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.