The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::RewriteRules - A system to rewrite text using regexp-based rules

VERSION

Version 0.10

SYNOPSIS

    use Text::RewriteRules;

    RULES email
    \.==> DOT 
    @==> AT 
    ENDRULES

    email("ambs@cpan.org") # returns ambs AT cpan DOT org

    RULES/m inc
    (\d+)=e=> $1+1 
    ENDRULE

    inc("I saw 11 cats and 23 docs") # returns I saw 12 cats and 24 docs

ABSTRACT

This module uses a simplified syntax for regexp-based rules for rewriting text. You define a set of rules, and the system applies them until no more rule can be applied.

Two variants are provided:

  1. traditional rewrite (RULES function):

     while it is possible do substitute
     | apply first substitution rule 
  2. cursor based rewrite (RULES/m function):

     add a cursor to the begining of the string
     while not reach end of string
     | apply substitute just after cursor and advance cursor
     | or advance cursor if no rule can be applied

DESCRIPTION

A lot of computer science problems can be solved using rewriting rules.

Rewriting rules consist of mainly two parts: a regexp (LHS: Left Hand Side) that is matched with the text, and the string to use to substitute the content matched with the regexp (RHS: Right Hand Side).

Now, why don't use a simple substitute? Because we want to define a set of rules and match them again and again, until no more regexp of the LHS matches.

A point of discussion is the syntax to define this system. A brief discussion shown that some users would prefer a function to receive an hash with the rules, some other, prefer some syntax sugar.

The approach used is the last: we use Filter::Simple such that we can add a specific non-perl syntax inside the Perl script. This improves legibility of big rewriting rules sytems.

This documentation is divided in two parts: first we will see the reference of the module. Kind of, what it does, with a brief explanation. Follows a tutorial which will be growing through time and releases.

SYNTAX REFERENCE

Note: most of the examples are very stupid, but that is the easiest way to explain the basic syntax.

The basic syntax for the rewrite rules is a block, started by the keyword RULES and ended by the ENDRULES. Everything between them is handled by the module and interpreted as rules or comments.

The RULES keyword can handle a set of flags (we will see that later), and requires a name for the rule-set. This name will be used to define a function for that rewriting system.

   RULES functioname
    ...
   ENDRULES

The function is defined in the main namespace where the RULES block appears.

In this block, each line can be a comment (Perl style), an empty line or a rule.

Basic Rule

A basic rule is a simple substitution:

  RULES foobar
  foo==>bar
  ENDRULES

The arrow ==> is used as delimiter. At its left is the regexp to match, at the right side, the substitution. So, the previous block defines a foobar function that substitutes all foo by bar.

Although this can seems similar to a global substitution, it is not. With a global substitution you can't do an endless loop. With this module it is very simple. I know you will get the idea.

You can use the syntax of Perl both on the left and right hand side of the rule, including $1....

Execution Rule

If the Perl substitution supports execution, why not to support it, also? So, you got the idea. Here is an example:

  RULES foo
  (\d+)b=e=>'b' x $1
  (\d+)a=eval=>'a' x ($1*2)
  ENDRULES

So, for any number followed by a b, we replace by that number of b's. For each number followed by an a, we replace them by twice that number of a's.

Also, you mean evaluation using an e or eval inside the arrow. I should remind you can mix all these rules together in the same rewriting system.

Conditional Rule

On some cases we want to perform a susbtitution if the pattern matches and a set of conditions about that pattern (or not) are true.

For that, we use a three part rule. We have the common rule plus the condition part, separated from the rule by !!. These conditional rules can be applied both for basic and exeuction rules.

  RULES translate
  ([[:alpha:]]+)=e=>$dic{$1}!! exists($dic{$1})
  ENDRULES

The previous example would translate all words that exist on the dictionary.

Begin Rule

Sometimes it is useful to change something on the string before starting to apply the rules. For that, there is a special rule named begin (or b for abbreviate) just with a RHS. This RHS is Perl code. Any Perl code. If you want to modify the string, use $_.

  RULES foo
  =b=> $_.=" END"
  ENDRULES

Last Rule

As you use last on Perl to skip the remaining code on a loop, you can also call a last (or l) rule when a specific pattern matches.

Like the begin rule with only a RHS, the last rule has only a LHS:

  RULES foo
  foobar=l=>
  ENDRULES

This way, the rules iterate until the string matches with foobar.

Rules with /x mode

It is possible to use the regular expressions /x mode in the rewrite rules. In this case:

  1. there must be an empty line between rules

  2. you can insert space and line breaks into the regular expression:

     RULES/x f1
     (\d+) 
     (\d{3}) 
     (000) 
     ==>$1 milhao e $2 mil!! $1 == 1
    
     ENDRULES

TUTORIAL

At the moment, just a set of commented examples.

Example1 -- from number to portuguese words (usint tradicional rewriting)

Example2 -- Naif translator (using cursor-based rewriting)

Conversion between numbers and words

Yes, you can use Lingua::PT::Nums2Words and similar (for other languages). Meanwhile, before it existed we needed to write such a conversion tool.

Here I present a subset of the rules (for numbers bellow 1000). The generated text is Portuguese but I think you can get the idea. I'll try to create a version for English very soon.

You can check the full code on the samples directory (file num2words).

  use Text::RewriteRules;

  RULES num2words
  100==>cem 
  1(\d\d)==>cento e $1 
  0(\d\d)==>$1
  200==>duzentos 
  300==>trezentos 
  400==>quatrocentos 
  500==>quinhentos 
  600==>seiscentos 
  700==>setecentos 
  800==>oitocentos 
  900==>novecentos 
  (\d)(\d\d)==>${1}00 e $2

  10==>dez 
  11==>onze 
  12==>doze 
  13==>treze 
  14==>catorze 
  15==>quinze 
  16==>dezasseis 
  17==>dezassete 
  18==>dezoito 
  19==>dezanove 
  20==>vinte 
  30==>trinta 
  40==>quarenta 
  50==>cinquenta 
  60==>sessenta 
  70==>setenta 
  80==>oitenta 
  90==>noventa 
  0(\d)==>$1
  (\d)(\d)==>${1}0 e $2

  1==>um 
  2==>dois 
  3==>três 
  4==>quatro 
  5==>cinco 
  6==>seis 
  7==>sete 
  8==>oito 
  9==>nove 
  0$==>zero 
  0==> 
    ==> 
   ,==>,
  ENDRULES

  num2words(123); # returns "cento e vinte e três"

Naif translator (using cursor-based rewriting)

 use Text::RewriteRules;
 %dict=(driver=>"motorista",
        the=>"o",
        of=>"de",
        car=>"carro");

 $word='\b\w+\b';

 if( b(a("I see the Driver of the car")) eq "(I) (see) o Motorista do carro" )
      {print "ok\n"}
 else {print "ko\n"}

 RULES/m a
 ($word)==>$dict{$1}!!                  defined($dict{$1})
 ($word)=e=> ucfirst($dict{lc($1)}) !!  defined($dict{lc($1)})
 ($word)==>($1)
 ENDRULES

 RULES/m b
 \bde o\b==>do
 ENDRULES

AUTHOR

Alberto Simões, <ambs@cpan.org>

José João Almeida, <jjoao@cpan.org>

BUGS

We know documentation is missing and you all want to use this module. In fact we are using it a lot, what explains why we don't have the time to write documentation.

Please report any bugs or feature requests to bug-text-rewrite@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

ACKNOWLEDGEMENTS

Damian Conway for Filter::Simple

COPYRIGHT & LICENSE

Copyright 2004-2005 Alberto Simões and José João Almeida, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 513:

Non-ASCII character seen before =encoding in '3==>três'. Assuming CP1252