Vincent Pit > re-engine-Plugin-0.09 > re::engine::Plugin

Download:
re-engine-Plugin-0.09.tar.gz

Dependencies

Annotate this POD

Website

CPAN RT

New  1
Open  0
View/Report Bugs
Module Version: 0.09   Source  

NAME ^

re::engine::Plugin - API to write custom regex engines

VERSION ^

Version 0.09

DESCRIPTION ^

As of perl 5.9.5 it's possible to lexically replace perl's built-in regular expression engine with your own (see perlreapi and perlpragma). This module provides a glue interface to the relevant parts of the perl C API enabling you to write an engine in Perl instead of the C/XS interface provided by the core.

The gory details

Each regex in perl is compiled into an internal REGEXP structure (see perlreapi), this can happen either during compile time in the case of patterns in the format /pattern/ or runtime for qr// patterns, or something inbetween depending on variable interpolation etc.

When this module is loaded into a scope it inserts a hook into $^H{regcomp} (as described in perlreapi and perlpragma) to have each regexp constructed in its lexical scope handled by this engine, but it differs from other engines in that it also inserts other hooks into %^H in the same scope that point to user-defined subroutines to use during compilation, execution etc, these are described in "CALLBACKS" below.

The callbacks (e.g. "comp") then get called with a re::engine::Plugin object as their first argument. This object provies access to perl's internal REGEXP struct in addition to its own state (e.g. a stash). The methods on this object allow for altering the REGEXP struct's internal state, adding new callbacks, etc.

CALLBACKS ^

Callbacks are specified in the re::engine::Plugin import list as key-value pairs of names and subroutine references:

    use re::engine::Plugin (
        comp => sub {},
        exec => sub {},
    );

To write a custom engine which imports your functions into the caller's scope use use the following snippet:

    package re::engine::Example;
    use re::engine::Plugin ();

    sub import
    {
        # Sets the caller's $^H{regcomp} his %^H with our callbacks
        re::engine::Plugin->import(
            comp => \&comp,
            exec => \&exec,
        );
    }

   *unimport = \&re::engine::Plugin::unimport;

    # Implementation of the engine
    sub comp { ... }
    sub exec { ... }

    1;

comp

    comp => sub {
        my ($rx) = @_;

        # return value discarded
    }

Called when a regex is compiled by perl, this is always the first callback to be called and may be called multiple times or not at all depending on what perl sees fit at the time.

The first argument will be a freshly constructed re::engine::Plugin object (think of it as $self) which you can interact with using the methods below, this object will be passed around the other callbacks and methods for the lifetime of the regex.

Calling die or anything that uses it (such as carp) here will not be trapped by an eval block that the pattern is in, i.e.

   use Carp 'croak';
   use re::engine::Plugin(
       comp => sub {
           my $rx = shift;
           croak "Your pattern is invalid"
               unless $rx->pattern ~~ /pony/;
       }
   );

   # Ignores the eval block
   eval { /you die in C<eval>, you die for real/ };

This happens because the real subroutine call happens indirectly at compile time and not in the scope of the eval block. This is how perl's own engine would behave in the same situation if given an invalid pattern such as /(/.

exec

    exec => sub {
        my ($rx, $str) = @_;

        # We always like ponies!
        return 1 if $str ~~ /pony/;

        # Failed to match
        return;
    }

Called when a regex is being executed, i.e. when it's being matched against something. The scalar being matched against the pattern is available as the second argument ($str) and through the str method. The routine should return a true value if the match was successful, and a false one if it wasn't.

This callback can also be specified on an individual basis with the "callbacks" method.

METHODS ^

str

    "str" ~~ /pattern/;
    # in comp/exec/methods:
    my $str = $rx->str;

The last scalar to be matched against the pattern or undef if there hasn't been a match yet.

perl's own engine always stringifies the scalar being matched against a given pattern, however a custom engine need not have such restrictions. One could write a engine that matched a file handle against a pattern or any other complex data structure.

pattern

The pattern that the engine was asked to compile, this can be either a classic Perl pattern with modifiers like /pat/ix or qr/pat/ix or an arbitary scalar. The latter allows for passing anything that doesn't fit in a string and five modifier characters, such as hashrefs, objects, etc.

mod

    my %mod = $rx->mod;
    say "has /ix" if %mod ~~ 'i' and %mod ~~ 'x';

A key-value pair list of the modifiers the pattern was compiled with. The keys will zero or more of imsxp and the values will be true values (so that you don't have to write exists).

You don't get to know if the eogc modifiers were attached to the pattern since these are internal to perl and shouldn't matter to regexp engines.

stash

    comp => sub { shift->stash( [ 1 .. 5 ) },
    exec => sub { shift->stash }, # Get [ 1 .. 5 ]

Returns or sets a user defined stash that's passed around as part of the $rx object, useful for passing around all sorts of data between the callback routines and methods.

minlen

    $rx->minlen($num);
    my $minlen = $rx->minlen // "not set";

The minimum length a string must be to match the pattern, perl will use this internally during matching to check whether the stringified form of the string (or other object) being matched is at least this long, if not the regexp engine in effect (that means you!) will not be called at all.

The length specified will be used as a a byte length (using SvPV), not a character length.

nparens

gofs

callbacks

    # A dumb regexp engine that just tests string equality
    use re::engine::Plugin comp => sub {
        my ($re) = @_;

        my $pat = $re->pattern;

        $re->callbacks(
            exec => sub {
                my ($re, $str) = @_;
                return $pat eq $str;
            },
        );
    };

Takes a list of key-value pairs of names and subroutines, and replace the callback currently attached to the regular expression for the type given as the key by the code reference passed as the corresponding value.

The only valid key is currently exec. See "exec" for more details about this callback.

num_captures

    $re->num_captures(
        FETCH => sub {
            my ($re, $paren) = @_;

            return "value";
        },
        STORE => sub {
            my ($re, $paren, $rhs) = @_;

            # return value discarded
        },
        LENGTH => sub {
            my ($re, $paren) = @_;

            return 123;
        },
    );

Takes a list of key-value pairs of names and subroutines that implement numbered capture variables. FETCH will be called on value retrieval (say $1), STORE on assignment ($1 = "ook") and LENGTH on length $1.

The second paramater of each routine is the paren number being requested/stored, the following mapping applies for those numbers:

    -2 => $` or ${^PREMATCH}
    -1 => $' or ${^POSTMATCH}
     0 => $& or ${^MATCH}
     1 => $1
     # ...

Assignment to capture variables makes it possible to implement something like Perl 6 :rw semantics, and since it's possible to make the capture variables return any scalar instead of just a string it becomes possible to implement Perl 6 match object semantics (to name an example).

named_captures

TODO: implement

perl internals still needs to be changed to support this but when it's done it'll allow the binding of %+ and %- and support the Tie::Hash methods FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, NEXTKEY and SCALAR.

CONSTANTS ^

REP_THREADSAFE

True iff the module could have been built with thread-safety features enabled.

REP_FORKSAFE

True iff this module could have been built with fork-safety features enabled. This will always be true except on Windows where it's false for perl 5.10.0 and below.

TAINTING ^

The only way to untaint an existing variable in Perl is to use it as a hash key or referencing subpatterns from a regular expression match (see perlsec), the latter only works in perl's regex engine because it explicitly untaints capture variables which a custom engine will also need to do if it wants its capture variables to be untanted.

There are basically two ways to go about this, the first and obvious one is to make use of Perl'l lexical scoping which enables the use of its built-in regex engine in the scope of the overriding engine's callbacks:

    use re::engine::Plugin (
        exec => sub {
            my ($re, $str) = @_; # $str is tainted

            $re->num_captures(
                FETCH => sub {
                    my ($re, $paren) = @_;

                    # This is perl's engine doing the match
                    $str ~~ /(.*)/;

                    # $1 has been untainted
                    return $1;
                },
            );
        },
    );

The second is to use something like Taint::Util which flips the taint flag on the scalar without invoking the perl's regex engine:

    use Taint::Util;
    use re::engine::Plugin (
        exec => sub {
            my ($re, $str) = @_; # $str is tainted

            $re->num_captures(
                FETCH => sub {
                    my ($re, $paren) = @_;

                    # Copy $str and untaint the copy
                    untaint(my $ret = $str);

                    # Return the untainted value
                    return $ret;
                },
            );
        },
    );

In either case a regex engine using perl's regex api or this module is responsible for how and if it untaints its variables.

SEE ALSO ^

perlreapi, Taint::Util

TODO & CAVEATS ^

here be dragons

DEPENDENCIES ^

perl 5.10.

A C compiler. This module may happen to build with a C++ compiler as well, but don't rely on it, as no guarantee is made in this regard.

XSLoader (standard since perl 5.006).

BUGS ^

Please report any bugs that aren't already listed at http://rt.cpan.org/Dist/Display.html?Queue=re-engine-Plugin to http://rt.cpan.org/Public/Bug/Report.html?Queue=re-engine-Plugin

AUTHORS ^

Ævar Arnfjörð Bjarmason <avar at cpan.org>

Vincent Pit <perl at profvince.com>

LICENSE ^

Copyright 2007,2008 Ævar Arnfjörð Bjarmason.

Copyright 2009,2010,2011 Vincent Pit.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: