The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Regexp::English - Perl module to create regular expressions more verbosely

SYNOPSIS

        use Regexp::English;

        my $re = Regexp::English
                -> start_of_line
                -> literal('Flippers')
                -> literal(':')
                -> optional
                        -> whitespace_char
                -> end
                -> remember
                        -> multiple
                                -> digit;
        
        while (<INPUT>) {
                if (my $match = $re->match($_)) {
                        print "$match\n";
                }
        }

DESCRIPTION

Regexp::English provides an alternate regular expression syntax, one that is slightly more verbose than the standard mechanisms. In addition, it adds a few convenient features, like incremental expression building and bound captures.

Nearly every regular expression available in Regexp::English can be accessed through a method, though some are also (or only) available as functions. These methods can roughly be divided into several categories: characters, quantifiers, groupings, and miscellaneous. The division wouldn't be so rough if the latter had a better name.

All methods return the Regexp::English object, so method calls can be chained, as in the example above. Though there is a new() method, any character method can be used to create an object, as can remember().

Matches are performed with the match() method. Alternately, if a Regexp::English object is used as if it were a compiled regular expression, it will be automagically compiled behind the scenes.

Characters

Character methods correspond to standard regular expression characters and metacharacters, for the most part. As a little bit of syntactic sugar, most of these methods have plurals, negations, and negated plurals. This is more clear looking at them. Though these are designed to be called on a new Regexp::English object while building up larger regular expressions, they may be used as class methods to access regular expression atoms, which are then used in larger regular expressions. This isn't entirely pretty, but it ought to work just about everywhere.

  • literal()

    Matches the provided literal string. Note that anything provided will be passed through quotemeta() automatically. If you're getting strange results, it's probably because of this.

  • class()

    Creates and matches a character class of the provided characters. Note that there is currently no validation of the character class, so you can create an uncompilable regular expression if you're not careful.

  • word_char()

    Matches any word character, respecting the current locale. By default, this matches alphanumerics and the underscore, corresponding to the \w token. The related word_chars() matches at least one word character, and non_word_char() and non_word_chars() match anything that is not an alphanumeric or underscore one or at least one of these characters, respectively.

  • whitespace_char()

    Matches any whitespace character, corresponding to the \s token. The corresponding plural is whitespace_chars(), with negations of non_whitespace_char() and non_whitespace_chars().

  • digit()

    Matches any numeric digit, corresponding to the \d token The plural is digits(), with negations of non_digit() and non_digits().

  • tab()

    Matches a tab character (\t). The plural is tabs(), and there is no negation.

  • newline()

    Matches a newline character (\n). The plural is newlines(), and there is no negation. This should imply the /s modifier, but it does not yet do so.

  • carriage_return()

    Matches a carriage return character (\r). The plural is carriage_returns(), and there is no negation.

  • form_feed()

    Matches a form feed character (\f). The plural is form_feeds(), and there is no negation.

  • alarm()

    Matches an alarm character (\a). The plural is alarms(), and there is no negation.

  • escape()

    Matches an escape character (\e). The plural is escapes(), and there is no negation.

  • start_of_line()

    Matches the start of a line, just like the ^ anchor.

  • beginning_of_string()

    Matches the beginning of a string, much like the ^ anchor.

  • end_of_line()

    Matches the end of a line, just like the $ anchor.

  • end_of_string()

    Matches the end of a string, much like the $ anchor, treating newlines appropriately depending on the /s or /m modifier.

  • very_end_of_string()

    Matches the very end of a string, just as the \z token. This does not ignore a trailing newline (if it exists).

  • end_of_previous_match()

    Matches the point at which a previous match ended, in a \globally-matched regular expression. This corresponds to the \G token and is related to pos().

  • word_boundary()

    Matches the zero-width boundary between a word character and a non-word character, corresponding to the \b token. There is no plural, but the negation is non_word_boundary().

Quantifiers

Quantifiers provide a mechanism to specify how many items to expect, in general or specific terms. These may be exported into the calling package's namespace with the :standard argument to the use() call, but the preferred interface is to use them as method calls. This is slightly more complicated, but cleaner conceptually. The interface may change slightly in the future, if someone comes up with something even better.

By default, quantifiers operate on the next arguments, not the previous ones. (It is much easier to program this way.) For example, to match multiple digits, one might write:

        my $re = Regexp::English->new
                ->multiple
                        ->digits;

The indentation should make this more clear.

Quantifiers persist until a match is attempted or the corresponding end() method is called. As match() calls end() internally, all active quantifiers will be closed when a match is attempted. There is currently no way to re-open a quantifier even if you add to a Regexp::English object. This is a non-trivial problem (as the author understands it), and there's no good solution for it in normal regular expressions anyway.

If you have imported the quantifiers, you can pass the quantifiables as arguments:

        use Regexp::English qw( :standard );

        my $re = Regexp::English->new
                ->multiple('a');

The open quantifier will automatically be closed for you. Though this syntax is slightly more visually appealing, it does involve exporting quite a few methods into your namespace, and is thus not the default. Besides that, if you get in this habit, you'll eventually have to use the :all tag. Better to get used to the method calls, or to push Vahe to write Regexp::Easy. :)

  • zero_or_more()

    Matches as many items as possible. Note that "possible" includes matching zero items. Note also that "item" means "whatever it's told to match". By default, this is greedy.

  • multiple()

    Matches at least one item, but as many as possible. By default, this is greedy.

  • optional()

    Marks an item as optional -- that is, the pattern will match with or without the item.

  • minimal()

    This quantifier modifies either zero_or_more() or multiple(), and disables greediness, asking for as few matches as possible.

Groupings

Groupings function much the same as quantifiers, though they have semantic differences. The most important similarity is that they can be used with the function or the method interface. Obviously, the method interface is preferable, but see the documentation for end() for more information.

Groupings generally correspond to advanced Perl regular expression features like lookaheads and lookbehinds. If you find yourself using them on a regular basis, you're probably ready to graduate to hand-rolled regular expressions (or to contribute code to improve Regexp::English :).

  • comment()

    Marks the item as a comment, which has no bearing on the match and really doesn't give you anything here either. Don't let that stop you, though.

  • group()

    Groups items together (often to use a single quantifier on them) without actually capturing them. This isn't very useful either, because the Quantifiers handle this for you.

  • followed_by()

    Marks the item as a zero-width positive look-ahead assertion. This means that the pattern must match the item after the previous bits, but the item is not considered part of the matched string.

  • not_followed_by()

    Marks the item as a zero-width negative look-ahead assertion. This means that the pattern must not match the item after the previous bits, but the item is still not considered part of the matched string.

  • after()

    Marks the item as a zero-width positive look-behind assertion. This means the pattern must match the item before the following bits. Super funky, and may have subtle bugs -- look-behinds tend to need fixed width items, and Regexp::English currently doesn't enforce this.

  • not_after()

    Marks the item as a zero-width negative look-behind assertion. This means the pattern must not match the item before the following bits. This is also susceptible to the fixed-width rule.

Miscellaneous

These subroutines don't really fit anywhere else. They're useful, and mostly cool.

  • new()

    Creates a new Regexp::English object. Though some methods do this for you automagically if you need one, this is the best way to start a regular expression.

  • match()

    Compiles and attempts to match the Regexp::English object against a passed-in regular expression. If there are any captured variables, they'll be returned. Otherwise, a true or false value will be returned.

  • remember()

    Causes Regexp::English to remember an item which will then be returned or otherwise made available when calling match(). Normally, these items are returned from match() in order of their declaration within the regular expression. They may also be bound to variables. Pass in a reference to a scalar as the first argument and the scalar will automagically be populated with the matched value on each subsequent match. That means you can write:

            my ($first, $second);
    
            my $re = Regexp::English->new
                    ->remember(\$first)
                            ->multiple('a')
                            ->remember(\$second)
                                    ->word_char;
    
            foreach my $match (qw( aab aaac ad )) {
                    if ($re->match($match)) {
                            print "$second\t$first\n";
                    }
            }

    This will print:

            b       aaab
            c       aac
            d       ad

    Pretty cool, no?

  • end()

    Ends an open Quantifier or Grouping. If you pass no arguments, it will end only the most recently opened item. If you pass a numeric argument, it will end that many recently opened items. It does not currently check to see if you pass in a number, so only pass in numbers, or be prepared for odd results.

  • compile()

    Compiles and returns the pattern-in-progress, ending any and all open Quantifier or Groupings. This uses qr//. Note if you attempt anything that could stringify the object, this method is called. This appears to include treating a Regexp::English object as a regular expression. Nifty.

  • or()

    Provides alternation capabilities. This has been improved in version 0.21 to the point where it is actually useful. The preferred interface is very similar to Grouping calls:

            my $re = Regexp::English->new
                    ->group
                            ->digit
                            ->or
                            ->word_char;

    Wrapping the entire alternation in group() or some other Grouping method is highly recommended, as you might want to use a Quantifier or something more complex:

            my $re = Regexp::English->new
                    ->remember
                                    ->literal('root beer')
                            ->or
                                    ->literal('milkshake')
                    ->end;

    If you find this onerous, you can also pass arguments to or(), which will be grouped together in non-capturing braces. Note that you will have to import the appropriate functions or fully qualify them. Calling these functions as class methods is not currently guaranteed to work reliably. It may never be guaranteed to work reliably. Properly indented, the method interface looks nicer anyway, but you have two options:

            my $functionre = Regexp::English->new
                    ->or( Regexp::English::digit, Regexp::English::word_char );
            
            my $classmethodre = Regexp::English->new
                    ->or( Regexp::English->digit, Regexp::English->word_char );
  • debug()

    Returns the regular expression so far. This can be handy if you know what you're doing.

EXPORTS

By default, nothing is exported. This is an object oriented module, and this is how it should be. You can import the Quantifier and Grouping subroutines by providing the :standard argument to the use() line, and the Character methods with the :chars tag.

        use Regexp::English qw( :standard :chars );

You could also use the :all tag:

        use Regexp::English qw( :all );

This interface may change slightly in the future. If you find yourself exporting things, you should look into Vahe Sarkissian's upcoming Regexp::Easy module. This is probably news to him, too. :)

TODO

  • Add not()

  • More error checking

  • Add a few tests here and there

  • Add POSIX character classes ?

  • Delegate to Regexp::Common ?

  • Allow other language backends (probably just add documentation for this)

  • Improve documentation

AUTHOR

chromatic, <chromatic@wgz.org>, with many suggestions from Vahe Sarkissian <vsarkiss@pobox.com> and Damian Conway <damian@cs.monash.edu.au>

COPYRIGHT

Copyright 2001-2002 by chromatic.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

SEE ALSO

perlre