The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
  <head>
    <title>BuzzSaw - Design</title>
  </head>

  <body>
    <h1>BuzzSaw - Design</h1>

    <p>The following sections give a high-level overview of the design
    of the BuzzSaw log processing framework. The implementation is
    based on the <a href="intro.html#philosophy">design philosophy</a>
    described in the introductor section of the documentation.</p>

    <p>The entire BuzzSaw system can really be reduced down to the
    need to do two specific tasks: importing of data and report
    generation. The whole system revolves around the central database
    into which all necessary data is stored.</p>

    <h2>The Database</h2>

    <p>All events of interest are stored in the database. The decision
    was made to use the PostgreSQL server because of it's excellent
    feature set, reliability and scalability. It was clear from the
    outset that there would be the potential to eventually store a
    very large number of log messages (and associated derived data) so
    scalability and speed is of particular concern.</p>

    <p>A full description of the database schema is given
    elsewhere. The high-level view is that each log message of
    interest is recorded as an <em>event</em>. Associated with
    each <em>event</em> is a set of zero or more <em>tags</em> and
    zero or more pieces of <em>extra_info<em>. An <em>event</em> is
    split down into fields representing the date/time, hostname, user,
    program, process pid of the program and the full message. Tags are
    simple labels applied to an event (e.g. <code>auth_failure</code>)
    whereas extra information entries have both an arbitrary name and
    value (e.g. <code>source_address</code>. For speed many of these
    fields and combinations of fields are indexed to improve query
    times.</p>

    <p>The BuzzSaw interface to the database (see
    the <code>BuzzSaw::DB</code> module for full details) is built
    using the Perl <code>DBIx::Class</code> object-relational
    mapper. This is an excellent module which provides the ability to
    very easily handle complex queries. For speed in a few parts of
    the code base we do use raw SQL statements via the standard DBI
    module but that is only where absolutely essential.</p>

    <p>The implementation of various internal processes relies on
    PostgreSQL functions and triggers which means that BuzzSaw is
    currently only going to work with PostgreSQL. Having said that,
    it's not likely to require a lot of work to rewrite those features
    into the language supported by some other database engine if
    required.</p>

    <h2>Importing</h2>

    <p>The import process is driven by
    the <code>BuzzSaw::Importer</code> Perl module. The import process
    reads through the log messages from each data source. If an event
    has not previously been stored in the database then it will be
    parsed and the event data will be put through the stack of
    filters. If any filter declares an interest in an event then it
    will be stored at the end of the process. Additionally, any filter
    can attach tags and associated extra information even if it does
    not declare an interest in the event being stored.</p>

    <h3>Data Sources</h3>

    <p>The importer process can have any number of data sources. A
    data source is any implementation of
    the <code>BuzzSaw::DataSource</code> Moose role. The data source
    is required to deliver log messages one at a time to the importer
    process.</p>

    <p>Currently there is only
    the <code>BuzzSaw::DataSource::Files</code> Perl module. This
    module can search through a hierarchy of directories and find
    files which match a POSIX or Perl regular expression. As well as
    standard text files, it supports opening files which are
    compressed with gzip or bzip2. When a file is opened a lock is
    recorded in the database to avoid multiple processes working on
    the same data concurrently. When the reading of a file has
    completed the name is recorded in the database along with the
    SHA-256 checksum of the file contents. This helps avoid
    reprocessing files which have been seen previously.</p>

    <h3>Parsing</h3>

    <p>Each data source requires a parser module which implements
    the <code>BuzzSaw::Parser</code> Moose role. The parser module is
    used to split a log entry into separate parts, e.g. date, program,
    pid, message. Mostly this is a case of being able to handle the
    particular date/time format being used in the log entry. The
    parser module is called on every log message so it is expected to
    be fast.</p>

    <p>
    Currently there is only the <code>BuzzSaw::Parser::RFC3339</code>
    Perl module. This handles date/time stamps which are formatted
    according to the guidelines in RFC3339 (e.g. looks
    like <code>2013-03-28T11:57:30.025350+00:00</code>).
    </p>

    <h3>Filtering</h3>

    <p>After a log message has been parsed into various fields as
    an <em>event</em> it is passed through a stack of filters. All
    events will go through the filter stack in the same sequence.  It
    is possible to make decisions in one filter based on the results
    of previous filters. If one or more filters declare an interest in
    an event it will be stored. It is not possible for a filter to
    overturn a positive vote from any previous filter.</p>

    <p>A filter is an implementation of
    the <code>BuzzSaw::Filter</code> Moose role. Currently there are
    the following filters: Cosign, Kernel, Sleep, SSH and
    UserClassifier. Most of them are straightforward filters that
    examine events and return a note of interest, where necessary,
    along with some tags or other information. The UserClassifier
    module is slightly different in that it never declares an
    interest, it just adds extra details when the userid field has
    been set by any previous filter in the stack (e.g. Cosign or
    SSH). Typically this module is added last in the stack so that it
    can process the userid value from any previous filter.</p>

    <h2>Reporting</h2>

    <p>The reporting process is driven by
      the <code>BuzzSaw::Reporter</code> Perl module. This module has
      a record of reports which should be generated on an hourly,
      daily, weekly or monthly basis. When it is run it is possible to
      run it in two modes. Either it is limited to running a specific
      set of reports (e.g. only hourly) or it is possible to ask it to
      run all jobs of all types which have not been run recently
      enough. So, in the latter case, if a weekly job has not been run
      for 8 days it would be run immediately. A record is kept of when
      each report was last run.</p>

    <p>A report will select all events which are have certain tags
    which occurred within a specified time period. The ordering of the
    events records retrieved can be controlled.</p>

    <p>A report can be generated using the
    generic <code>BuzzSaw::Report</code> module or, more typically, by
    implementing a specific sub-class which is used to specify the
    names of the relevant tags, the time period of interest, the name
    of the template to be used, etc. For convenience, when using a
    sub-class most of these attributes will have sensible defaults
    based on the name of the Perl module.</p>

    <p>A sub-class of the <code>BuzzSaw::Report</code> module can
    override specific parts of the process to do additional complex
    processing beyond the straightforward selection of events and
    subsequent printing of the raw data. For example, the Kernel
    report carries out extra analsis of the kernel logs to collate
    events which are associated with particular types of problem
    (e.g. an out-of-memory error or a kernel panic).</p>

    <p>A report is generated by passing the events and any results
    from additional processing to a template which is handled using
    the Perl Template Toolkit. A report can be simply printed to
    stdout or sent via email to multiple recipients.</p>

  </body>
</html>