The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Regexp::Assemble - Assemble multiple Regular Expressions into one RE

VERSION

This document describes version 0.02 of Regexp::Assemble, released 2004-11-19.

SYNOPSIS

  use Regexp::Assemble;
  
  my $ra = Regexp::Assemble->new;
  $ra->add( 'ab+c' );
  $ra->add( 'ab+\\d*\\s+c' );
  $ra->add( 'a\\w+\\d+' );
  $ra->add( 'a\\d+' );
  print $ra->re; # prints (?:a(?:b+(?:\d*\s+)?c|(?:\w+)?\d+))

DESCRIPTION

Regexp::Assemble allows you to take a number of regular expressions and assemble them into a single regular expression (or RE) that will match everything that any of the individual REs match, only what they match and nothing else.

The assembled RE is more sophisticated than a brute force join( '|', @list) concatenation. Common subexpressions are shared; alternations are introduced only when patterns diverge. As a result, backtracking is kept to a minimum. If a given path fails... there are no other paths to try and so the expression fails quickly. If no wildcards like .* appear, no backtracking will be performed. In very large expressions, this can provide a large speed boost.

As a result, instead of having a large list of expressions to loop over, the string only needs to be tested against one expression. This is especially interesting when on average the expression fails to match most of the time.

This is useful when you have a large number of patterns that you want to apply against a string. If you only that one of the patterns matched then this module is for you. Nonetheless, large numbers of alternations are dealt with in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie.

Some more examples of usage appear in the accompanying README.

METHODS

new

Creates a new Regexp::Assemble object. A set of key/value parameters can be supplied to control the finer details of the object's behaviour.

chomp, controls whether the pattern should be chomped before being lexed. Handy if you are reading lines from a file. By default, no chomping is performed.

filter, allows you to add a callback to enable sanity checks on the pattern being loaded.

mutable, controls whether new patterns can be added to the object after the RE is generated.

reduce, controls whether tail reduction occurs or not. If set, patterns like a(?:bc+d|ec+d) will be reduced to a[be]c+d. That is, the the end of the pattern in each part of the b... and d... alternations is identical, and hence is hoisted out of the alternation and placed after it.

lex, specifies the pattern used to lex the input lines into tokens. You could replace the default pattern by a more sophisticated version that matches arbitrarily nested parentheses, for example.

debug, controls whether copious amounts of output is produced during the loading stage or the reducing stage of assembly.

  my $ra = Regexp::Assemble->new;
  my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );

A more detailed explanation of these attributes follows.

clone

Clones the contents of a Regexp::Assemble object and creates a new object (in other words it performs a deep copy).

  my $copy = $ra->clone();

If the Storable module is installed, its dclone method will be used, otherwise the cloning will be performed using a pure perl approach.

add(LIST)

Takes a string, breaks it apart into a set of token (respecting meta characters) and inserts the resulting list into the R::A object. It uses a naive regular expression to lex the string that may be fooled complex expressions (specifically, it will fail to lex nested parenthetical expressions such as ab(?cd(?:ef)?gh) correctly). If this is the case, the end of the string will not be tokenised correctly and returned as one long string.

On the one hand, this may indicate that the patterns you are trying to feed the R::A object are too complex. Simpler patterns might allow the algorithm to work more effectively and spot reductions in the resulting pattern.

On the other hand, you can supply your own pattern to perform the lexing if you need, and if that is not enough, you can provide a code block for the ultimate in lexing flexibility.

A list of strings may be supplied, thus you can pass it a file handle of a file opened for reading:

    $re->add( '\d+-\d+-\d+-\d+\.example\.com' );
    $re->add( <IN> );

You propably need to set the chomp attribute on the object to get files to work correctly:

    my $re = Regexp::Assemble->new( chomp=>1 )->add( <IN> );
    # or
    my $re = Regexp::Assemble->new;
    $re->chomp(1)->add( <IN> );

This method is chainable.

insert(LIST)

Takes a list of tokens representing a regular expression and stores them in the object. Note: you should not pass it a bare regular expression, such as ab+c?d*e. You must pass it as a list of tokens, e.g. ('a', 'b+', 'c?', 'd*', 'e').

This method is chainable, e.g.:

  my $ra = Regexp::Assemble->new
    ->insert( qw[ a b+ c? d* e ] )
    ->insert( qw[ a c+ d+ e* f ] );

The add method calls insert internally.

as_string

Assemble the expression and return it as a string. You may want to do this if you are writing the pattern to a file.

If you have not set the mutable attribute on the object, calling this method will drain the internal data structure. Large numbers of patterns can eat a significant amount of memory, and this lets perl recover the memory used. Mutable means that you want to keep the internal data structure lying around in order to add additional patterns to the object after an assembled pattern has been produced.

re

Assembles the pattern and return it as a compiled RE, using the qr// operator. Note that there is no way of specifying flags to the qr operator. If you need to do something fancy, then you should retrieve the string using the as_string operator and apply the required qr//imx-type flags yourself.

As with as_string, calling this method will reset the internal data structures to free the memory used in assembling the RE. This behaviour can be controlled with the mutable attribute.

With method chaining, it is possible to produce a RE without having a temporary Regexp::Assemble object lying around, e.g.:

  my $re = Regexp::Assemble->new
    ->add( q[ab+cd+e] )
    ->add( q[ac\\d+e] )
    ->add( q[c\\d+e] )
    ->re;

The $re variable now contains a Regexp object that can be used directly:

  while( <> ) {
    /$re/ and print "Something in [$_] matched\n";
  )

Note that the re method is overloaded in string context, so you don't even need to save the RE in a separate variable. The following will work as expected:

  my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] );
  while( <IN> ) {
    if( /($re)/ ) {
      print "Here be giants: $1\n";
    }
  }
debug(NUMBER)

Turns debugging on or off. Statements are printed to the currently selected file handle (STDOUT by default). If you are already using this handle, you will have to arrange to select an output handle to a file of your own choosing, before call the add, as_string or re) functions, otherwise it will scribble all over your carefully formatted output.

0 turns debugging off, 1 emits debug traces dealing with adding new patterns to the object, 2 emits debug traces dealing with the reduction of the assembled pattern.

Values can be added (or or'ed together) to trace everything

   $r->debug(1)->add( '\\d+abc' );
chomp(0|1)

Turns chomping on or off. When loading an object from a file the lines will contain their line separator token. This may produce undesired results. In this case, call chomp with a value of 1 to enable autochomping. Chomping is off by default.

  $re->chomp( 1 );
  $re->add( <DATA> );
filter(CODE)

Allows you to install a callback to check that the pattern being loaded contains valid input. It receives a list on input. It is expected to return 0 to indicate that the pattern should not be added, anything else indicates that the contents are fine.

If you know that all patterns you expect to assemble contain a restricted set of of tokens (e.g. no spaces), you could do the following:

  $ra->filter(sub { not grep { / / } @_ });

or

  sub only_spaces_and_digits {
    not grep { ![\d ]
  }
  $ra->filter( \&only_spaces_and_digits );

These two examples will silently ignore faulty patterns, If you want the user to be made aware of the problem you should raise an error (via warn or die), log an error message, whatever is best. If you want to remove a filter, pass undef as a parameter.

  $ra->filter(undef);

This method is chainable.

lex(SCALAR)

Change the pattern used to break a string apart into tokens.

reduce(0|1)

Turns pattern reduction on or off. A reduced pattern may be considerably shorter than an unreduced pattern. Consider /sl(?:ip|op|ap)/ versus /sl[aio]p/. An unreduced pattern will be very similar to those produced by Regexp::Optimizer. Reduction is on by default.

mutable(0|1)

When the re or as_string methods are called the reduction algoritm kicks in and takes the current data structure and fold the common portions of the patterns that have been stored in the object. Once this occurs, it is no longer possible to add any more patterns.

In fact, the internal structures are release to free up memory. If you have a programa that adds additional patterns to an object over a long period, you can set the mutable attribute. This will stop the internal structure from being drained and you can continue to add patterns.

The main consequence is that it the assembled pattern will not undergo any reduction (as the internal data structure undergoes such a transformation as that it becomes very difficult to cope with the change that adding a new pattern would bring about. If this is a problem, the solution is to create a non-mutable object, continue adding to it as long as needed, and each time a new assembled regular expression is required, clone the object, turn the mutable attribute off and proceed as usual.

By default the mutable attribute defaults to zero. The method can be chained.

  $r->add( 'abcdef\\d+' )->mutable(0);
reset

Resets the internal state of the object, as if no add or insert methods had been called. Does not modify the state of controller attributes such as debug, lex, reduce and so forth.

Default_Lexer

Warning: the Default_Lexer function is a class method, not an object method. Do not call it on an object, bad things will happen.

  Regexp::Assemble::Default_Lexer( '.|\\d+' );

The Default_Lexer method lets you replace the default pattern used for all subsequently created Regexp::Assemble objects. It will not have any effect on existing objects. (It is also possible to override the lexer pattern used on a per-object basis).

The parameter should be an ordinary scalar, not a compiled pattern. The pattern should be capable of matching all parts of the string, i.e., the following constraint should hold true:

  $_ = 'this will be matched by a pattern';
  my $pattern = qr/./;
  my @token = m/($pattern)/g;
  $_ eq join( '' => @token ) or die;

If no parameter is supplied, the current default pattern in use will be returned.

DIAGNOSTICS

"ARRAY passed to _insert"

You have passed a complex data structure (a list of lists of lists) to add or insert. Solution: Don't do that.

"don't pass a refname to Default_Lexer"

You tried to replace the default lexer pattern with an object instead of a scalar. Solution: You probably tried to call $obj->Default_Lexer. Call qualified class method instead Regexp::Assemble::Default_Lexer.

"got a scalar failure (that's not supposed to happen)"

The module got incredibly confused. Solution: See if there is a more recent version of the module available. Or try and use the smallest number of patterns that continues to produce the error, send me the details and wait for a patch.

"lexed a zero-length token, an infinite loop would ensue"

The add method eats the input string one token at a time. If you supply your own lexer pattern, it may match the null character, in which case the loop that processes the string will loop an infinite number of times. Solution: Fix your pattern.

"was not passed an ARRAY ref"

A coding error. Can only occur when debugging is active. Solution: turn off debugging (it's off by default).

"was not passed a HASH ref"

A coding error. Can only occur when debugging is active. Solution: turn off debugging (it's off by default).

NOTES

The expressions produced by this module can be used with the PCRE library.

Where possible, feed R::A the simplest tokens possible. Don't add :a(?-\d+){2})b when a-\d+-\d+b. he reason is that if you also add a\d+c the resulting REs change dramatically: a(?:(?:-\d+){2}b|-\d+c) versus a-\d+(?:-\d+b|c). Since R::A doesn't analyse tokens, it doesn't know how to "unroll" the {2} quantifier, and will fail to notice the divergence after the first -d\d+.

What is more, when the string 'a-123000z' is matched against the first pattern, the regexp engine will have to backtrack over each alternation before determining that there is no match. In the second pattern, no such backtracking occurs. The engine scans up to the 'z' and then fails immediately, since neither of the alternations start with 'z'.

R::A does, however, understand character classes. Given a-b, axb and a\db, it will assemble these into a[-\dx]b. When - appears as a candidate for a character class it will be the first character in the class. When ^ appears as a candidate for a character class it will be the last character in the class. R::A will also replace all the digits 0..9 appearing in a character class by \d. I'd do it for letters as well, but thinking about accented characters and other glyphs hurts my head.

SEE ALSO

Regex::PreSuf

Regex::PreSuf takes a string and chops it itself into tokens of length 1. Since it can't deal with tokens of more than one character, it can't deal with meta-characters and thus no regular expressions. Which is the main reason why I wrote this module.

Regexp::Optimizer

Regexp::Optimizer produces regular expressions that are similar to those produced by R::A with reductions switched off.

Text::Trie

Text::Trie is well worth investigating. Tries can outperform very bushy (read: many alternations) patterns.

LIMITATIONS

The current version does not let you know which initial regular expression, that forms part of the assembled RE, was the one that caused the match. This may be addressed in a future version. (The solution probably lies in using the (?{code}) contruct to assist in tagging the original expressions).

Regexp::Assemble does not attempt to find common substrings. For instance, it will not collapse /aabababc/ down to /a(?:ab}{3}c/. If there's a module out there that performs this sort of string analysis I'd like to know about it. But keep in mind that the algorithms that do this are very expensive: quadratic or worse.

Regexp::Assemble does not emit lookahead or lookbehind assertions in the expressions it produces. Nor is the wonderful and frightening (?...)> construct used. These and other optimisations may be introduced in further releases if there is a demand for them, if I figure out how to use them in this module, or if someone sends patches.

You can't remove a pattern that has been added to an object. You'll just have to start over again. Adding a pattern is difficult enough, I'd need a solid argument to convince me to add a remove method. If you need to do this you should read the documentation on the mutable and clone methods.

BUGS

The module does not produce POSIX-style regular expressions.

The algorithm used to assemble the regular expressions makes extensive use of mutually-recursive functions (i.e.: A calls B, B calls A, ...) For deeply similar expressions, it may be possible to provoke "Deep recursion" warnings.

The code is far too messy. I need to better understand which code paths are exercised when, and factor out redundant blocks.

Regexp::Assemble does not analyse the tokens it is given. This means is that if, for instance, the following two pattern lists are given: a\d and a\d+, it will not determine that \d can be matched by \d+. Instead, it will produce a(?:\d|\d+).

When a string is tested against an assembled RE, it may match when it shouldn't, or else not match when it should. The module has been tested extensively, and has an extensive test suite, but you never know... In either case, it is a bug, and I want to know about it.

A temporary work-around is to disable reductions:

  my $pattern = $assembler->reduce(0)->re;

A discussion about implementation details and where I think bugs might lie appears in the accompanying README file. If this file is not available locally, you should be able to find a copy on the Web at your nearest CPAN mirror.

perl -MRegexp::Assemble -le 'print Regexp::Assemble::VERSION'
perl -V

If you are feeling brave, extensive debugging traces are available.

Please report all bugs at http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble|rt.cpan.org

ACKNOWLEDGEMENTS

This module grew out of work I did building access maps for Postfix, a modern SMTP mail transfer agent. See http://www.postfix.org/ for more information. I used Perl to build large regular expressions for blocking dynamic/residential IP addresses to cut down on spam and viruses. Once I had the code running for this, it was easy to start adding stuff to block really blatant spam subject lines, bogus HELO strings, spammer mailer-ids and more...

I presented the work at the French Perl Workshop in 2004, and the thing most people asked was whether the underlying mechanism for assembling the REs was available as a module. At that time it was nothing more that a twisty maze of scripts, all different. The interest shown indicated that a module was called for. I'd like to thank the people who showed interest. Hey, it's going to make my messy scripts smaller, in any case.

Thanks go to Jean Forget and Philippe Blayo who looked over an early version, H. Merijn Brandt, who stopped over in Paris one evening, and discussed things over a few beers. Nicholas Clark pointed out that while what this module does (?:c|sh)ould be done in perl's core, as per the 2004 TODO, he encouraged to me continue with the development of this module. In any event, this module allows one to gauge the difficulty of undertaking the endeavour in C. I'd rather gouge my eyes out with a blunt pencil.

A tip 'o the hat to Paul Johnson who settled the question as to whether this module should live in the Regex:: namespace, or Regexp:: namespace. If you're not convinced, try running the following one-liner:

      C<perl -le 'print ref qr//'>

Thanks also to broquaint on Perlmonks, who answered a question pertaining to traversing an early version of the underlying data structure used by the module. (Sadly, that code is no more).

AUTHOR

David Landgren, david@landgren.net

Copyright (C) 2004

http://www.landgren.net/perl/

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1479:

You forgot a '=back' before '=head1'