The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Regexp::Assemble - Assemble multiple Regular Expressions into a single RE

VERSION

This document describes version 0.15 of Regexp::Assemble, released 2005-04-27.

SYNOPSIS

  use Regexp::Assemble;
  
  my $ra = Regexp::Assemble->new;
  $ra->add( 'ab+c' );
  $ra->add( 'ab+-' );
  $ra->add( 'a\w\d+' );
  $ra->add( 'a\d+' );
  print $ra->re; # prints a(?:\w?\d+|b+[-c])

DESCRIPTION

Regexp::Assemble takes an arbitrary number of regular expressions and assembles them into a single regular expression (or RE) that will match all that each of the individual REs match.

As a result, instead of having a large list of expressions to loop over, the string only needs to be tested against one expression. This is interesting when you have several thousand patterns to deal with. Serious effort is made to produce the smallest pattern possible.

It is also possible to track the orginal patterns, so that you can determine which, among the source patterns that form the assembled pattern, was the one that caused the match to occur.

You should realise that large numbers of alternations are processed in perl's regular expression engine in O(n) time, not O(1). If you are still having performance problems, you should look at using a trie. Note that Perl's own regular expression engine will implement trie optimisations in perl 5.10 (they are already available in perl 5.9.3 if you want to try them out). Regexp::Assemble will do the right thing when it knows it's running on a a trie'd perl. (At least in some version after this one).

Some more examples of usage appear in the accompanying README. If that file isn't easy to access locally, you can find it on a web repository such as http://search.cpan.org/ or http://kobesearch.cpan.org/.

METHODS

new

Creates a new Regexp::Assemble object. A set of key/value parameters can be supplied to control the finer details of the object's behaviour.

flags, sets the flags imsx that should be applied to the resulting pattern. Warning: no error checking is done, you should ensure that the flags you pass are understood by the version of Perl in use.

chomp, controls whether the pattern should be chomped before being lexed. Handy if you are reading lines from a file. By default, no chomping is performed. See the $/ variable in perlvar.

lookahead, controls whether the pattern should contain zero-width lookahead assertions (i.e. (?=[abc])(?:bob|alice|charles). This is not activated by default, because in many circumstances the cost of processing the assertion itself outweighs the benefit of its faculty for short-circuiting a match that will fail. This is sensitive to the probability of a match succeding, so if you're worried about performance you'll have to benchmark a sample population of targets to see which way the benefits lie.

track, controls whether you want know which of the initial patterns was the one that matched. See the matched method for more details. Note that in this mode of operation THERE ARE SECURITY IMPLICATIONS OF WHICH YOU YOU SHOULD BE AWARE.

indent, the number of spaces used to indent nested grouping of a pattern. Use this to produce a pretty-printed pattern. See the as_string method for a more detailed explanation.

pre_filter, allows you to add a callback to enable sanity checks on the pattern being loaded. This callback is triggered before the pattern is split apart by the lexer. In other words, it operates on the entire pattern. If you are loading patterns from a file, this would be an appropriate place to remove comments.

filter, allows you to add a callback to enable sanity checks on the pattern being loaded. This callback is triggered after the pattern has been split apart by the lexer.

mutable, controls whether new patterns can be added to the object after the RE is generated.

reduce, controls whether tail reduction occurs or not. If set, patterns like a(?:bc+d|ec+d) will be reduced to a[be]c+d. That is, the end of the pattern in each part of the b... and d... alternations is identical, and hence is hoisted out of the alternation and placed after it.

lex, specifies the pattern used to lex the input lines into tokens. You could replace the default pattern by a more sophisticated version that matches arbitrarily nested parentheses, for example.

debug, controls whether copious amounts of output is produced during the loading stage or the reducing stage of assembly.

  my $ra = Regexp::Assemble->new;
  my $rb = Regexp::Assemble->new( chomp => 1, debug => 3 );

A more detailed explanation of these attributes follows.

clone

Clones the contents of a Regexp::Assemble object and creates a new object (in other words it performs a deep copy).

  my $copy = $ra->clone();

If the Storable module is installed, its dclone method will be used, otherwise the cloning will be performed using a pure perl approach.

add(LIST)

Takes a string, breaks it apart into a set of tokens (respecting meta characters) and inserts the resulting list into the R::A object. It uses a naive regular expression to lex the string that may be fooled complex expressions (specifically, it will fail to lex nested parenthetical expressions such as ab(cd(ef)?gh)ij correctly). If this is the case, the end of the string will not be tokenised correctly and returned as one long string.

On the one hand, this may indicate that the patterns you are trying to feed the R::A object are too complex. Simpler patterns might allow the algorithm to work more effectively and spot reductions in the resulting pattern.

On the other hand, you can supply your own pattern to perform the lexing if you need. The test suite contains an example of a lexer pattern that will match one level of nested parentheses.

Note that there is an internal optimisation that will bypass a much of the lexing process. If a string contains no \ (backslash), [ (open square bracket), ( (open paren), ? (question mark), + (plus), * (star) or { (open curly), a character split will be performed directly.

A list of strings may be supplied, thus you can pass it a file handle of a file opened for reading:

    $re->add( '\d+-\d+-\d+-\d+\.example\.com' );
    $re->add( <IN> );

You probably need to set the chomp attribute on the object to get files to work correctly:

    my $re = Regexp::Assemble->new( chomp=>1 )->add( <IN> );
    # or
    my $re = Regexp::Assemble->new;
    $re->chomp(1)->add( <IN> );

If the file is very large, it may be more efficient to use a while loop, to read the file line-by-line:

    $re->->add($_) while <IN>;

The pre_filter method provides allows you to filter input on a line-by-line basis.

This method is chainable.

insert(LIST)

Takes a list of tokens representing a regular expression and stores them in the object. Note: you should not pass it a bare regular expression, such as ab+c?d*e. You must pass it as a list of tokens, e.g. ('a', 'b+', 'c?', 'd*', 'e').

This method is chainable, e.g.:

  my $ra = Regexp::Assemble->new
    ->insert( qw[ a b+ c? d* e ] )
    ->insert( qw[ a c+ d+ e* f ] );

The add method calls insert internally.

The filter method allows you to accept or reject the addition of the entire pattern on a token-by-token basis.

as_string

Assemble the expression and return it as a string. You may want to do this if you are writing the pattern to a file. The following arguments can be passed to control the aspect of the resulting pattern:

indent, the number of spaces used to indent nested grouping of a pattern. Use this to produce a pretty-printed pattern (for some definition of "pretty"). The resulting output is rather verbose. The reason is to ensure that the metacharacters (?: and ) always occur on otherwise empty lines. This allows you grep the result for an even more synthetic view of the pattern:

  egrep -v '^ *[()]' <regexp.file>

The result of the above is quite readable. Remember to backslash the spaces appearing in your own patterns if you wish to use an indented pattern in an m/.../x construct. Indenting is ignored if tracking is enabled.

The indent argument takes precedence over the indent method/attribute of the object.

If you have not set the mutable attribute on the object, calling this method will drain the internal data structure. Large numbers of patterns can eat a significant amount of memory, and this lets perl recover the memory used. Mutable means that you want to keep the internal data structure lying around in order to add additional patterns to the object after an assembled pattern has been produced.

If you want to reduce the pattern and continue to add new patterns, clone the object and reduce the clone, leaving the original object alone.

re

Assembles the pattern and return it as a compiled RE, using the qr// operator.

As with as_string, calling this method will reset the internal data structures to free the memory used in assembling the RE. This behaviour can be controlled with the mutable attribute.

The indent attribute, documented in the as_string method, can be used here (it will be ignored if tracking is enabled).

With method chaining, it is possible to produce a RE without having a temporary Regexp::Assemble object lying around, e.g.:

  my $re = Regexp::Assemble->new
    ->add( q[ab+cd+e] )
    ->add( q[ac\\d+e] )
    ->add( q[c\\d+e] )
    ->re;

The $re variable now contains a Regexp object that can be used directly:

  while( <> ) {
    /$re/ and print "Something in [$_] matched\n";
  )

The re method is called when the object is used in string context (hence, withing an m// operator), so by and large you do not even need to save the RE in a separate variable. The following will work as expected:

  my $re = Regexp::Assemble->new->add( qw[ fee fie foe fum ] );
  while( <IN> ) {
    if( /($re)/ ) {
      print "Here be giants: $1\n";
    }
  }

This approach does not work with tracked patterns. The match and matched methods must be used instead, see below.

match(SCALAR)

If pattern tracking is in use, you must use re 'eval' in order to make things work correctly. At a minimum, this will make your code look like this:

    my $did_match = do { use re 'eval'; $target =~ /$ra/ }
    if( $did_match ) {
        print "matched ", $ra->matched, "\n";
    }

(The main reason is that the $^R variable is currently broken and an ugly workaround is required. See Perl bug #32840 for more information if you are curious. The README also contains more information).

The important thing to note is that with use re 'eval', THERE ARE SECURITY IMPLICATIONS WHICH YOU IGNORE AT YOUR PERIL. The problem is this: if you do not have strict control over the patterns being fed to Regexp::Assemble when tracking is enabled, and someone slips you a pattern such as /^(?{system 'rm -rf /'})/ and you attempt to match a string against the resulting pattern, you will know Fear and Loathing.

What is more, the $^R workaround means that that tracking does not work if you perform a bare /$re/ pattern match as shown above. You have to instead call the match method, in order to supply the necessary context to take care of the tracking housekeeping details.

   if( defined( my $match = $ra->match($_)) ) {
       print "  $_ matched by $match\n";
   }

In the case of a successul match, the original matched pattern is returned directly. The matched pattern will also be available through the matched method.

(Except that the above is not true for 5.6.0: the match method returns true or undef, and the matched method always returns undef).

If you are capturing parts of the pattern e.g. foo(bar)rat you will want to get at the captures. See the mbegin, mend and mvar methods. If you are not using captures then you may safely ignore this section.

mbegin

This method returns a copy of @- at the moment of the last match. You should ordinarily not need to bother with this, mvar should be able to supply all your needs.

mend

This method returns a copy of @+ at the moment of the last match.

mvar(NUMBER)

The mvar method returns the captures of the last match. mvar(1) corresponds to $1, mvar(2) to $2, and so on. mvar(0) happens to return the target string matched, as a byproduct of walking down the @- and @+ arrays after the match.

If called without a parameter, mvar will return a reference to an array containing all captures.

matched

If pattern tracking has been set, via the track attribute, or through the track method, this method will return the original pattern of the last successful match. Returns undef match has yet been performed, or tracking has not been enabled.

See below in the NOTES section for additional subtleties of which you should be aware of when tracking patterns.

Note that this method is not available in 5.6.0, due to limitations in the implementation of (?{...}) at the time.

debug(NUMBER)

Turns debugging on or off. Statements are printed to the currently selected file handle (STDOUT by default). If you are already using this handle, you will have to arrange to select an output handle to a file of your own choosing, before call the add, as_string or re) functions, otherwise it will scribble all over your carefully formatted output.

0 turns debugging off, 1 emits debug traces dealing with adding new patterns to the object, 2 emits debug traces dealing with the reduction of the assembled pattern. 4 emits debug traces dealing with the lexing of the input pattern. Calling with no arguments also turns debugging off.

Values can be added (or or'ed together) to trace everything

   $r->debug(1)->add( '\\d+abc' );
dump

Produces a synthetic view of the internal data structure. How to interpret the results is left as an exercise to the reader.

  print $r->dump;
chomp(0|1)

Turns chomping on or off. When loading an object from a file the lines will contain their line separator token. This may produce undesired results. In this case, call chomp with a value of 1 to enable autochomping. Chomping is off by default.

  $re->chomp( 1 );
  $re->add( <DATA> );
indent(NUMBER)

Sets the level of indent for pretty-printing nested groups within a pattern. See the as_string method for more details. When called without a parameter, no indenting is performed.

  $re->indent( 4 );
  print $re->as_string;
lookahead(0|1)

Turns on zero-width lookahead assertions. This is usually beneficial when you expect that the pattern will usually fail. If you expect that the pattern will usually match you will probably be worse off.

flags(STRING)

Sets the flags that govern how the pattern behaves (for versions of Perl up to 5.9 or so, these are imsx). By default no flags are enabled.

track(0|1)

Turns tracking on or off. When this attribute is enabled, additional housekeeping information is inserted into the assembled expression using ({...} embedded code constructs. This provides the necessary information to determine which, of the original patterns added, was the one that caused the match.

  $re->track( 1 );
  if( $target =~ /$re/ ) {
    print "$target matched by ", $re->matched, "\n";
  }

Note that when this functionality is enabled, no reduction is performed and no character classes are generated. In other words, brag|tag is not reduced down to (?:br|t)ag and dig|dim is not reduced to di[gm].

pre_filter(CODE)

Allows you to install a callback to check that the pattern being loaded contains valid input. It receives the pattern as a whole to be added, before it been tokenised by the lexer. It may to return 0 or undef to indicate that the pattern should not be added, any true value indicates that the contents are fine.

TODO: how to remove comments

If you want to remove the filter, pass undef as a parameter.

  $ra->pre_filter(undef);

This method is chainable.

filter(CODE)

Allows you to install a callback to check that the pattern being loaded contains valid input. It receives a list on input, after it has been tokenised by the lexer. It may to return 0 or undef to indicate that the pattern should not be added, any true value indicates that the contents are fine.

If you know that all patterns you expect to assemble contain a restricted set of of tokens (e.g. no spaces), you could do the following:

  $ra->filter(sub { not grep { / / } @_ });

or

  sub only_spaces_and_digits {
    not grep { ![\d ] } @_
  }
  $ra->filter( \&only_spaces_and_digits );

These two examples will silently ignore faulty patterns, If you want the user to be made aware of the problem you should raise an error (via warn or die), log an error message, whatever is best. If you want to remove a filter, pass undef as a parameter.

  $ra->filter(undef);

This method is chainable.

lex(SCALAR)

Change the pattern used to break a string apart into tokens. You can examine the eg/naive script as a starting point.

reduce(0|1)

Turns pattern reduction on or off. A reduced pattern may be considerably shorter than an unreduced pattern. Consider /sl(?:ip|op|ap)/ versus /sl[aio]p/. An unreduced pattern will be very similar to those produced by Regexp::Optimizer. Reduction is on by default. Turning it off makes assembly much faster.

mutable(0|1)

When the re or as_string methods are called the reduction algoritm kicks in and takes the current data structure and fold the common portions of the patterns that have been stored in the object. Once this occurs, it is no longer possible to add any more patterns.

In fact, the internal structures are release to free up memory. If you have a programa that adds additional patterns to an object over a long period, you can set the mutable attribute. This will stop the internal structure from being drained and you can continue to add patterns.

The main consequence is that it the assembled pattern will not undergo any reduction (as the internal data structure undergoes such a transformation as that it becomes very difficult to cope with the change that adding a new pattern would bring about. If this is a problem, the solution is to create a non-mutable object, continue adding to it as long as needed, and each time a new assembled regular expression is required, clone the object, turn the mutable attribute off and proceed as usual.

By default the mutable attribute defaults to zero. The method can be chained.

  $r->add( 'abcdef\\d+' )->mutable(0);
reset

Resets the internal state of the object, as if no add or insert methods had been called. Does not modify the state of controller attributes such as debug, lex, reduce and the like.

Default_Lexer

Warning: the Default_Lexer function is a class method, not an object method. It is a fatal error to call it as an object method.

The Default_Lexer method lets you replace the default pattern used for all subsequently created Regexp::Assemble objects. It will not have any effect on existing objects. (It is also possible to override the lexer pattern used on a per-object basis).

The parameter should be an ordinary scalar, not a compiled pattern. If the pattern fails to match all parts of the string, the missing parts will be returned as single chunks. Therefore the following pattern is legal (albeit rather cork-brained):

    Regexp::Assemble::Default_Lexer( '\\d' );

The above pattern will split up input strings digit by digit, and all non-digit characters as single chucks.

DIAGNOSTICS

"don't pass a refname to Default_Lexer"

You tried to replace the default lexer pattern with an object instead of a scalar. Solution: You probably tried to call $obj->Default_Lexer. Call qualified class method instead Regexp::Assemble::Default_Lexer.

"filter method not passed a coderef"

A reference to a subroutine (anonymous or otherwise) was expected. Solution: read the documentation for the filter method.

NOTES

This module has been tested successfully with a range of versions of perl, from 5.005_03 to 5.9.2. Use of 5.6.0 is not recommended.

The expressions produced by this module can be used with the PCRE library.

Where possible, supply the simplest tokens possible. Don't add X(?-\d+){2})Y when X-\d+-\d+Y will do. The reason is that if you also add X\d+Z the resulting assembly changes dramatically: X(?:(?:-\d+){2}Y|-\d+Z) versus X-\d+(?:-\d+Y|Z). Since R::A doesn't analyse tokens, it doesn't "unroll" the {2} quantifier, and will fail to notice the divergence after the first -d\d+.

What is more, when the string 'X-123000P' is matched against the first assembly, the regexp engine will have to backtrack over each alternation (the one that ends in Y and the one that ends in Z) before determining that there is no match. No such backtracking occurs in the second pattern: as soon as the engine encounters the 'P' in the target string, neither of the alternations at that point (-\d+Y or Z) succeed and the match fails.

Regexp::Assemble does, however, know how to build character classes. Given a-b, axb and a\db, it will assemble these into a[-\dx]b. When - (dash) appears as a candidate for a character class it will be the first character in the class. When ^ (circumflex) appears as a candidate for a character class it will be the last character in the class.

It also knows about meta-characters than can "absorb" regular characters. For instance, given X\d and X5, it knows that 5 can be represented by \d and so the assembly is just X\d. The "absorbent" meta-characters it deals with are ., \d, \s and \W and their complements. It also knows that \d and \D (and similar pairs) can be replaced by ..

Regexp::Assemble deals correctly with quotemeta's propensity to backslash many characters that have no need to be. Backslashes on non-metacharacters will be removed. Similarly, in character classes, a number of characters lose their magic and so no longer need to be backslashed within a character class. Two common examples are . (dot) and $. Such characters will lose their backslash.

At the same time, it will also process \Q...\E sequences. When such a sequence is encountered, the inner section is extracted and quotemeta is applied to the section. The resulting quoted text is then used in place of the original unquoted text, and the \Q and \E metacharacters are thrown awa. Similar processing occurs with the \U...\E and \L...\E sequences.

Regexp::Assemble will also replace all the digits 0..9 appearing in a character class by \d. I'd do it for letters as well, but thinking about accented characters and other glyphs hurts my head.

In an alternation, the longest paths are chosen first (e.g. horse|bird|dog). When two paths have the same length, the path with the most subpaths will appear first. This aims to put the "busiest" paths to the front of the alternation. E.g., the list bad, bit, few, fig and fun will produce the pattern (?:f(?:ew|ig|un)|b(?:ad|it)).

When tracking is in use, no reduction is performed. Furthermore, no character classes are formed. The reason is that it becomes just too difficult to determine the original pattern. Consider the the two patterns pale and palm. These would be reduced to (?:pal[em]. The final character matches one of two possibilities. To resolve whether it matched an 'e' or 'm' would require a whole lot more housekeeping. Without character classes it becomes much easier.

Similarly, dogfood and seafood would form (?:dog|sea)food. When the pattern is being assembled, the tracking decision needs to be made at the end of the grouping, but the tail of the pattern has not yet been visited. Deferring things to make this work correctly is a vast hassle. Tracked patterns will therefore be bulkier than simple patterns.

There is an open bug on this issue:

http://rt.perl.org/rt3/Ticket/Display.html?id=32840

If this bug is ever resolved, tracking would become much easier to deal with (none of the match hassle would be required - you could just match like a regular RE and it would Just Work).

SEE ALSO

perlre

General information about Perl's regular expressions.

re

Specific information about use re 'eval'.

Regex::PreSuf

Regex::PreSuf takes a string and chops it itself into tokens of length 1. Since it can't deal with tokens of more than one character, it can't deal with meta-characters and thus no regular expressions. Which is the main reason why I wrote this module.

Regexp::Optimizer

Regexp::Optimizer produces regular expressions that are similar to those produced by R::A with reductions switched off. It's biggest drawback is that it is exponentially slower than Regexp::Assemble on very large sets of patterns.

Text::Trie

Text::Trie is well worth investigating. Tries can outperform very bushy (read: many alternations) patterns.

LIMITATIONS

Regexp::Assemble does not attempt to find common substrings. For instance, it will not collapse /cabababc/ down to /c(?:ab){3}c/. If there's a module out there that performs this sort of string analysis I'd like to know about it. But keep in mind that the algorithms that do this are very expensive: quadratic or worse.

Regexp::Assemble does not interpret meta-character modifiers. For instance, if the following two patterns are given: X\d and X\d+, it will not determine that \d can be matched by \d+. Instead, it will produce X(?:\d|\d+). Along a similar line of reasoning, it will not determine that Z and Z\d+ is equivalent to Z\d* (It will produce Z(?:\d+)? instead).

You can't remove a pattern that has been added to an object. You'll just have to start over again. Adding a pattern is difficult enough, I'd need a solid argument to convince me to add a remove method. If you need to do this you should read the documentation on the mutable and clone methods.

Tracking doesn't really work at all with 5.6.0. It works better in subsequent 5.6 releases. For maximum reliability, the use of a 5.8 release is strongly recommended.

Regexp::Assemble does not (yet)? employ the (?>...) construct.

The module does not produce POSIX-style regular expressions. This would be quite easy to add, if there was a demand for it.

BUGS

The algorithm used to assemble the regular expressions makes extensive use of mutually-recursive functions (i.e.: A calls B, B calls A, ...) For deeply similar expressions, it may be possible to provoke "Deep recursion" warnings.

The module has been tested extensively, and has an extensive test suite (that achieves 100% statement coverage), but you never know... A bug may manifest itself in two ways: creating a pattern that cannot be compiled, such as a\(bc), or a pattern that compiles correctly but that either matches things it shouldn't, or doesn't match things it should. It is assumed that Such problems will occur when the reduction algorithm encounters some sort of edge case. A temporary work-around is to disable reductions:

  my $pattern = $assembler->reduce(0)->re;

A discussion about implementation details and where bugs might lurk appears in the README file. If this file is not available locally, you should be able to find a copy on the Web at your nearest CPAN mirror.

Seriously, though, a number of people have been using this module to create expressions anywhere from 140Kb to 600Kb in size, and it seems to be working according to spec. Thus, I don't think there are any serious bugs remaining.

If you are feeling brave, extensive debugging traces are available.

Please report all bugs at http://rt.cpan.org/NoAuth/Bugs.html?Dist=Regexp-Assemble|rt.cpan.org

Make sure you include the output from the following two commands:

  perl -MRegexp::Assemble -le 'print Regexp::Assemble::VERSION'
  perl -V

There is a mailing list for the discussion of Regexp::Assemble. Subscription details are available at http://www.mongueurs.net/mailman/Regexp::Assemble.

ACKNOWLEDGEMENTS

This module grew out of work I did building access maps for Postfix, a modern SMTP mail transfer agent. See http://www.postfix.org/ for more information. I used Perl to build large regular expressions for blocking dynamic/residential IP addresses to cut down on spam and viruses. Once I had the code running for this, it was easy to start adding stuff to block really blatant spam subject lines, bogus HELO strings, spammer mailer-ids and more...

I presented the work at the French Perl Workshop in 2004, and the thing most people asked was whether the underlying mechanism for assembling the REs was available as a module. At that time it was nothing more that a twisty maze of scripts, all different. The interest shown indicated that a module was called for. I'd like to thank the people who showed interest. Hey, it's going to make my messy scripts smaller, in any case.

Thomas Drugeon has been a valuable sounding board for trying out new ideas. Jean Forget and Philippe Blayo looked over an early version. H. Merijn Brandt stopped over in Paris one evening, and discussed things over a few beers.

Nicholas Clark pointed out that while what this module does (?:c|sh)ould be done in perl's core, as per the 2004 TODO, he encouraged to me continue with the development of this module. In any event, this module allows one to gauge the difficulty of undertaking the endeavour in C. I'd rather gouge my eyes out with a blunt pencil.

Paul Johnson settled the question as to whether this module should live in the Regex:: namespace, or Regexp:: namespace. If you're not convinced, try running the following one-liner:

  perl -le 'print ref qr//'

Thanks also to broquaint on Perlmonks, who answered a question pertaining to traversing an early version of the underlying data structure used by the module (Sadly, that code is no more). bart and ysth also provided a couple of tips pertaining to the Correct Use of overloading.

AUTHOR

David Landgren, david@landgren.net

Copyright (C) 2004-2005. All rights reserved.

http://www.landgren.net/perl/

If you use this module, I'd love to hear about what you're using it for. If you want to be informed of updates, send me a note.

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.