Jesse Luehrs > perl-5.17.1 > CharClass::Matcher

Download:
perl-5.17.1.tar.bz2

Annotate this POD

Website

Source   Latest Release: perl-5.19.10

NAME ^

CharClass::Matcher -- Generate C macros that match character classes efficiently

SYNOPSIS ^

    perl Porting/regcharclass.pl

DESCRIPTION ^

Dynamically generates macros for detecting special charclasses in latin-1, utf8, and codepoint forms. Macros can be set to return the length (in bytes) of the matched codepoint, or the codepoint itself.

To regenerate regcharclass.h, run this script from perl-root. No arguments are necessary.

Using WHATEVER as an example the following macros will be produced:

is_WHATEVER(s,is_utf8)
is_WHATEVER_safe(s,e,is_utf8)

Do a lookup as appropriate based on the is_utf8 flag. When possible comparisons involving octect<128 are done before checking the is_utf8 flag, hopefully saving time.

is_WHATEVER_utf8(s)
is_WHATEVER_utf8_safe(s,e)

Do a lookup assuming the string is encoded in (normalized) UTF8.

is_WHATEVER_latin1(s)
is_WHATEVER_latin1_safe(s,e)

Do a lookup assuming the string is encoded in latin-1 (aka plan octets).

is_WHATEVER_cp(cp)

Check to see if the string matches a given codepoint (hypothetically a U32). The condition is constructed as as to "break out" as early as possible if the codepoint is out of range of the condition.

IOW:

  (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))

Thus if the character is X+1 only two comparisons will be done. Making matching lookups slower, but non-matching faster.

Additionally it is possible to generate what_ variants that return the codepoint read instead of the number of octets read, this can be done by suffixing '-cp' to the type description.

CODE FORMAT

perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"

AUTHOR ^

Author: Yves Orton (demerphq) 2007

BUGS ^

No tests directly here (although the regex engine will fail tests if this code is broken). Insufficient documentation and no Getopts handler for using the module as a script.

LICENSE ^

You may distribute under the terms of either the GNU General Public License or the Artistic License, as specified in the README file.

syntax highlighting: