The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    re::engine::Plugin - API to write custom regex engines

VERSION
    Version 0.12

DESCRIPTION
    As of perl 5.9.5 it's possible to lexically replace perl's built-in
    regular expression engine with your own (see perlreapi and perlpragma).
    This module provides a glue interface to the relevant parts of the perl
    C API enabling you to write an engine in Perl instead of the C/XS
    interface provided by the core.

  The gory details
    Each regex in perl is compiled into an internal "REGEXP" structure (see
    perlreapi), this can happen either during compile time in the case of
    patterns in the format "/pattern/" or runtime for "qr//" patterns, or
    something inbetween depending on variable interpolation etc.

    When this module is loaded into a scope it inserts a hook into
    $^H{regcomp} (as described in perlreapi and perlpragma) to have each
    regexp constructed in its lexical scope handled by this engine, but it
    differs from other engines in that it also inserts other hooks into
    "%^H" in the same scope that point to user-defined subroutines to use
    during compilation, execution etc, these are described in "CALLBACKS"
    below.

    The callbacks (e.g. "comp") then get called with a re::engine::Plugin
    object as their first argument. This object provies access to perl's
    internal REGEXP struct in addition to its own state (e.g. a stash). The
    methods on this object allow for altering the "REGEXP" struct's internal
    state, adding new callbacks, etc.

CALLBACKS
    Callbacks are specified in the "re::engine::Plugin" import list as
    key-value pairs of names and subroutine references:

        use re::engine::Plugin (
            comp => sub {},
            exec => sub {},
            free => sub {},
        );

    To write a custom engine which imports your functions into the caller's
    scope use use the following snippet:

        package re::engine::Example;
        use re::engine::Plugin ();

        sub import
        {
            # Sets the caller's $^H{regcomp} his %^H with our callbacks
            re::engine::Plugin->import(
                comp => \&comp,
                exec => \&exec,
                free => \&free,
            );
        }

       *unimport = \&re::engine::Plugin::unimport;

        # Implementation of the engine
        sub comp { ... }
        sub exec { ... }
        sub free { ... }

        1;

  comp
        comp => sub {
            my ($rx) = @_;

            # return value discarded
        }

    Called when a regex is compiled by perl, this is always the first
    callback to be called and may be called multiple times or not at all
    depending on what perl sees fit at the time.

    The first argument will be a freshly constructed "re::engine::Plugin"
    object (think of it as $self) which you can interact with using the
    methods below, this object will be passed around the other callbacks and
    methods for the lifetime of the regex.

    Calling "die" or anything that uses it (such as "carp") here will not be
    trapped by an "eval" block that the pattern is in, i.e.

       use Carp 'croak';
       use re::engine::Plugin(
           comp => sub {
               my $rx = shift;
               croak "Your pattern is invalid"
                   unless $rx->pattern =~ /pony/;
           }
       );

       # Ignores the eval block
       eval { /you die in C<eval>, you die for real/ };

    This happens because the real subroutine call happens indirectly at
    compile time and not in the scope of the "eval" block. This is how
    perl's own engine would behave in the same situation if given an invalid
    pattern such as "/(/".

  exec
        my $ponies;
        use re::engine::Plugin(
            exec => sub {
                my ($rx, $str) = @_;

                # We always like ponies!
                if ($str =~ /pony/) {
                    $ponies++;
                    return 1;
                }

                # Failed to match
                return;
            }
        );

    Called when a regex is being executed, i.e. when it's being matched
    against something. The scalar being matched against the pattern is
    available as the second argument ($str) and through the str method. The
    routine should return a true value if the match was successful, and a
    false one if it wasn't.

    This callback can also be specified on an individual basis with the
    "callbacks" method.

  free
        use re::engine::Plugin(
            free => sub {
                my ($rx) = @_;

                say 'matched ' ($ponies // 'no')
                    . ' pon' . ($ponies > 1 ? 'ies' : 'y');

                return;
            }
        );

    Called when the regexp structure is freed by the perl interpreter. Note
    that this happens pretty late in the destruction process, but still
    before global destruction kicks in. The only argument this callback
    receives is the "re::engine::Plugin" object associated with the regexp,
    and its return value is ignored.

    This callback can also be specified on an individual basis with the
    "callbacks" method.

METHODS
  str
        "str" =~ /pattern/;
        # in comp/exec/methods:
        my $str = $rx->str;

    The last scalar to be matched against the pattern or "undef" if there
    hasn't been a match yet.

    perl's own engine always stringifies the scalar being matched against a
    given pattern, however a custom engine need not have such restrictions.
    One could write a engine that matched a file handle against a pattern or
    any other complex data structure.

  pattern
    The pattern that the engine was asked to compile, this can be either a
    classic Perl pattern with modifiers like "/pat/ix" or "qr/pat/ix" or an
    arbitary scalar. The latter allows for passing anything that doesn't fit
    in a string and five modifier characters, such as hashrefs, objects,
    etc.

  mod
        my %mod = $rx->mod;
        say "has /ix" if %mod =~ 'i' and %mod =~ 'x';

    A key-value pair list of the modifiers the pattern was compiled with.
    The keys will zero or more of "imsxp" and the values will be true values
    (so that you don't have to write "exists").

    You don't get to know if the "eogc" modifiers were attached to the
    pattern since these are internal to perl and shouldn't matter to regexp
    engines.

  stash
        comp => sub { shift->stash( [ 1 .. 5 ) },
        exec => sub { shift->stash }, # Get [ 1 .. 5 ]

    Returns or sets a user defined stash that's passed around as part of the
    $rx object, useful for passing around all sorts of data between the
    callback routines and methods.

  minlen
        $rx->minlen($num);
        my $minlen = $rx->minlen // "not set";

    The minimum "length" a string must be to match the pattern, perl will
    use this internally during matching to check whether the stringified
    form of the string (or other object) being matched is at least this
    long, if not the regexp engine in effect (that means you!) will not be
    called at all.

    The length specified will be used as a a byte length (using SvPV), not a
    character length.

  nparens
  gofs
  callbacks
        # A dumb regexp engine that just tests string equality
        use re::engine::Plugin comp => sub {
            my ($re) = @_;

            my $pat = $re->pattern;

            $re->callbacks(
                exec => sub {
                    my ($re, $str) = @_;
                    return $pat eq $str;
                },
            );
        };

    Takes a list of key-value pairs of names and subroutines, and replace
    the callback currently attached to the regular expression for the type
    given as the key by the code reference passed as the corresponding
    value.

    The only valid keys are currently "exec" and "free". See "exec" and
    "free" for more details about these callbacks.

  num_captures
        $re->num_captures(
            FETCH => sub {
                my ($re, $paren) = @_;

                return "value";
            },
            STORE => sub {
                my ($re, $paren, $rhs) = @_;

                # return value discarded
            },
            LENGTH => sub {
                my ($re, $paren) = @_;

                return 123;
            },
        );

    Takes a list of key-value pairs of names and subroutines that implement
    numbered capture variables. "FETCH" will be called on value retrieval
    ("say $1"), "STORE" on assignment ("$1 = "ook"") and "LENGTH" on "length
    $1".

    The second paramater of each routine is the paren number being
    requested/stored, the following mapping applies for those numbers:

        -2 => $` or ${^PREMATCH}
        -1 => $' or ${^POSTMATCH}
         0 => $& or ${^MATCH}
         1 => $1
         # ...

    Assignment to capture variables makes it possible to implement something
    like Perl 6 ":rw" semantics, and since it's possible to make the capture
    variables return any scalar instead of just a string it becomes possible
    to implement Perl 6 match object semantics (to name an example).

  named_captures
    TODO: implement

    perl internals still needs to be changed to support this but when it's
    done it'll allow the binding of "%+" and "%-" and support the Tie::Hash
    methods FETCH, STORE, DELETE, CLEAR, EXISTS, FIRSTKEY, NEXTKEY and
    SCALAR.

CONSTANTS
  "REP_THREADSAFE"
    True iff the module could have been built with thread-safety features
    enabled.

  "REP_FORKSAFE"
    True iff this module could have been built with fork-safety features
    enabled. This will always be true except on Windows where it's false for
    perl 5.10.0 and below.

TAINTING
    The only way to untaint an existing variable in Perl is to use it as a
    hash key or referencing subpatterns from a regular expression match (see
    perlsec), the latter only works in perl's regex engine because it
    explicitly untaints capture variables which a custom engine will also
    need to do if it wants its capture variables to be untanted.

    There are basically two ways to go about this, the first and obvious one
    is to make use of Perl'l lexical scoping which enables the use of its
    built-in regex engine in the scope of the overriding engine's callbacks:

        use re::engine::Plugin (
            exec => sub {
                my ($re, $str) = @_; # $str is tainted

                $re->num_captures(
                    FETCH => sub {
                        my ($re, $paren) = @_;

                        # This is perl's engine doing the match
                        $str =~ /(.*)/;

                        # $1 has been untainted
                        return $1;
                    },
                );
            },
        );

    The second is to use something like Taint::Util which flips the taint
    flag on the scalar without invoking the perl's regex engine:

        use Taint::Util;
        use re::engine::Plugin (
            exec => sub {
                my ($re, $str) = @_; # $str is tainted

                $re->num_captures(
                    FETCH => sub {
                        my ($re, $paren) = @_;

                        # Copy $str and untaint the copy
                        untaint(my $ret = $str);

                        # Return the untainted value
                        return $ret;
                    },
                );
            },
        );

    In either case a regex engine using perl's regex api or this module is
    responsible for how and if it untaints its variables.

SEE ALSO
    perlreapi, Taint::Util

TODO & CAVEATS
    *here be dragons*

    *   Engines implemented with this module don't support "s///" and "split
        //", the appropriate parts of the "REGEXP" struct need to be wrapped
        and documented.

    *   Still not a complete wrapper for perlreapi in other ways, needs
        methods for some "REGEXP" struct members, some callbacks aren't
        implemented etc.

    *   Support overloading operations on the "qr//" object, this allow
        control over the of "qr//" objects in a manner that isn't limited by
        "wrapped"/"wraplen".

            $re->overload(
                '""'  => sub { ... },
                '@{}' => sub { ... },
                ...
            );

    *   Support the dispatch of arbitary methods from the re::engine::Plugin
        qr// object to user defined subroutines via AUTOLOAD;

            package re::engine::Plugin;
            sub AUTOLOAD
            {
                our $AUTOLOAD;
                my ($name) = $AUTOLOAD =~ /.*::(.*?)/;
                my $cv = getmeth($name); # or something like that
                goto &$cv;
            }

            package re::engine::SomeEngine;

            sub comp
            {
                my $re = shift;

                $re->add_method( # or something like that
                    foshizzle => sub {
                        my ($re, @arg) = @_; # re::engine::Plugin, 1..5
                    },
                );
            }

            package main;
            use re::engine::SomeEngine;
            later:

            my $re = qr//;
            $re->foshizzle(1..5);

    *   Implement the dupe callback, test this on a threaded perl (and learn
        how to use threads and how they break the current model).

    *   Allow the user to specify ->offs either as an array or a packed
        string. Can pack() even pack I32? Only IV? int?

    *   Add tests that check for different behavior when curpm is and is not
        set.

    *   Add tests that check the refcount of the stash and other things I'm
        mucking with, run valgrind and make sure everything is destroyed
        when it should.

    *   Run the debugger on the testsuite and find cases when the intuit and
        checkstr callbacks are called. Write wrappers around them and add
        tests.

DEPENDENCIES
    perl 5.10.

    A C compiler. This module may happen to build with a C++ compiler as
    well, but don't rely on it, as no guarantee is made in this regard.

    XSLoader (standard since perl 5.6.0).

BUGS
    Please report any bugs that aren't already listed at
    <http://rt.cpan.org/Dist/Display.html?Queue=re-engine-Plugin> to
    <http://rt.cpan.org/Public/Bug/Report.html?Queue=re-engine-Plugin>

AUTHORS
    Ævar Arnfjörð Bjarmason "<avar at cpan.org>"

    Vincent Pit "<perl at profvince.com>"

LICENSE
    Copyright 2007,2008 Ævar Arnfjörð Bjarmason.

    Copyright 2009,2010,2011,2013,2014,2015 Vincent Pit.

    This program is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.