Ingy döt Net >
Pegex-0.21 >
lib/Pegex/Tutorial/Calculator.pod

When you look around the web for stuff about parsing, you inevitably find a bunch of examples about how to write a precedence parser for mathematical expressions.

- http://en.wikipedia.org/wiki/Operator-precedence_parserZ
- http://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing/
- http://www.hokstad.com/operator-precedence-parser.html
- http://www.perlmonks.org/?node_id=554516

This tutorial is the Pegex version of that. Pegex actually comes with an examples directory that contains two arithmetic expression parser/evaluator caclulator programs:

- https://github.com/ingydotnet/pegex-pm/blob/master/examples/calculator1.pl
- https://github.com/ingydotnet/pegex-pm/blob/master/examples/calculator2.pl

They both do the same thing but using different parsing approaches. We'll cover both of them in detail. I hope you'll find that Pegex handles operator precedence parsing elegantly.

Precedence parsers are interesting because you need to deal with operators that have the same precedence, different precedences and both left and right associativity.

Consider the equation:

1 - 2 ^ 3 ^ 4 + 5 * 6 / 7

Normal precedence and associativity rules make this the same as:

(1 - (2 ^ (3 ^ 4)) + ((5 * 6) / 7))

Our calculator should produce the same result for both. Note that this means we will be parsing numbers, 5 operators, parentheses, and separating whitespace.

Here's an example of the calculator program running:

> perl examples/calculator1.pl Enter an equation: 1+2*3 1+2*3 = 7 Enter an equation: (1 + 2) * 3 (1 + 2) * 3 = 9 Enter an equation:

Most of the solutions that you'll read about on the web, involve (or assume) a lexing/tokenizing step before parsing. Pegex always parses an input stream directly, pulling out "tokens" that it needs using regex captures. So the parse happens as one operation, which has many advantages.

But how do we get the operator precedence rules into this? Well, we have 2 different ways:

Our first example calculator uses what is known as the Operator Precedence Climbing method. See: http://en.wikipedia.org/wiki/Operator-precedence_parser#Precedence_climbing_method.

This is basically a clever technique of specifying our grammar rules such that they imply precedence. Here's the pegex grammar from the code:

expr: add_sub add_sub: mul_div+ % /~([<PLUS><DASH>])~/ mul_div: exp+ % /~([<STAR><SLASH>])~/ exp: token+ % /~<CARET>~/ token: /~<LPAREN>~/ expr /~<RPAREN>~/ | number number: /~(<DASH>?<DIGIT>+)~/

It's a little bit wonky but it works. It says that any expression is an add/subtract and that an add/subtract is really a multiply/divide etc. Finally after the last operator comes the number token and the parens.

It feels a bit backwards. One of the biggest drawbacks of PCM is that it becomes less and less efficient with more and more operators. It needs to go through the whole tree, just to find each number.

But it works and the code is minimal. The receiver class gets the numbers in the correct order, immediately evaluates the answer and returns the answer for each level. Whatever the return value of the final operation is, becomes the result of the parse. Here's the receiver class:

{ package Calculator; use base 'Pegex::Tree'; sub got_add_sub { my ($self, $list) = @_; $self->flatten($list); while (@$list > 1) { my ($a, $op, $b) = splice(@$list, 0, 3); unshift @$list, ($op eq '+') ? ($a + $b) : ($a - $b); } @$list; } sub got_mul_div { my ($self, $list) = @_; $self->flatten($list); while (@$list > 1) { my ($a, $op, $b) = splice(@$list, 0, 3); unshift @$list, ($op eq '*') ? ($a * $b) : ($a / $b); } @$list; } sub got_exp { my ($self, $list) = @_; $self->flatten($list); while (@$list > 1) { my ($a, $b) = splice(@$list, -2, 2); push @$list, $a ** $b; } @$list; } }

As you can see, it has an action method for each level or precedence. It loops over the expression, evaluating it. Whether it loops from left to right or right to left depends on the associativity that we want to use.

Our runner code looks like this:

while (1) { print "\nEnter an equation: "; my $input = <>; chomp $input; last unless length $input; calc($input); } sub calc { my $expr = shift; my $calculator = pegex($grammar, receiver => 'Calculator'); my $result = eval { $calculator->parse($expr) }; print $@ || "$expr = $result\n"; }

And that's the whole thing. We have a working calculator as specced!

However the real point of this is to explore good parsing techniques, and the PCM leaves us wanting to try something more efficient. Let's try another approach...

An age old way of parsing expressions is to somehow get the numbers and operators into an RPN (Reverse Polish Notation) stack, which is each operand follow by its operator. Once in that form, precedence and associativity are accounted for.

For example:

1 / 2 - ( -3 * 4 )

becomes:

1, 2, /, -3, 4, *, -

To evaluate an RPN you pop off an operator and then attempt to pop off and operand. If the operand is another operator you recurse. When you have 2 operands you do the operation and put the result back on the stack. When there is only 1 element on the stack, you are done. That's your result.

Let's look at our new grammar in `calculator2.pl`

:

expr: operand (operator operand)* operator: /~([<PLUS><DASH><STAR><SLASH><CARET>])~/ operand: num | /~<LPAREN>~/ expr /~<RPAREN>~/ num: /~(<DASH>?<DIGIT>+)~/

This is much easier to understand. We are just parsing out the tokens. In a (very real) sense, we are using Pegex as a lexer.

Now let's look at the receiver class:

{ package Calculator; use base 'Pegex::Tree', 'Precedence'; my $operator_precedence_table = { '+' => {p => 1, a => 'l'}, '-' => {p => 1, a => 'l'}, '*' => {p => 2, a => 'l'}, '/' => {p => 2, a => 'l'}, '^' => {p => 3, a => 'r'}, }; sub got_expr { my ($self, $expr) = @_; $self->precedence_rpn($expr, $operator_precedence_table); } }

This is also much simpler. There's only one method. What's going on? Well the secret is that I put the code to turn the tokens into RPN in a separate base class called `examples/lib/Precedence.pm`

.

This is an implementation of Edsger Dijkstra's famous Shunting-yard Algorithm from 1961! It's only 20 lines of Perl. I won't include it inline here, but have a look at it for yourself. https://github.com/ingydotnet/pegex-pm/blob/master/examples/lib/Precedence.pm

The Shunting-yard algorithm simply takes a list of expression tokens and transforms them into an RPN stack. It uses information from a precedence/associativity table like the one above.

Unlike `calculator1.pl`

where we evaluated as we parsed, `calculator2.pl`

creates an RPN which is akin to an AST. In other words, it's more like something an actually language compiler would do.

But we are writing a calculator and we still need to evaluate this pupy. I changed the runner code to look like this:

sub calc { my $expr = shift; my $calculator = pegex($grammar, receiver => 'Calculator'); my $rpn = eval { $calculator->parse($expr) }; my $result = RPN::evaluate($rpn); print $@ || "$expr = $result\n"; }

As you can see I also moved the evaluator code into a separate/reusable module: `examples/lib/RPN.pm`

. This is another 30 lines of Perl. Please take a look at it. https://github.com/ingydotnet/pegex-pm/blob/master/examples/lib/RPN.pm

So overall, this second solution was a bit more code, but also feels more solid on several levels.

Pegex strives to be the nicest and most reusable way to write new parsers. Operator precedence parsers are a necessary part of parsing mathematical expressions and computer languages. This tutorial showed you 2 ways to do it. As the demands for Pegex grow, we may see even more ways to do it.

syntax highlighting: