pod/Marpa_R2.pod - metacpan.org

# Copyright 2014 Jeffrey Kegler
# This file is part of Marpa::R2.  Marpa::R2 is free software: you can
# redistribute it and/or modify it under the terms of the GNU Lesser
# General Public License as published by the Free Software Foundation,
# either version 3 of the License, or (at your option) any later version.
#
# Marpa::R2 is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
# Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser
# General Public License along with Marpa::R2.  If not, see
# http://www.gnu.org/licenses/.

=head1 NAME

Marpa::R2 - Release 2 of Marpa

=head1 Synopsis

=for Marpa::R2::Display
name: Landing page synopsis
normalize-whitespace: 1

    use Marpa::R2;

    my $dsl = <<'END_OF_DSL';
    :default ::= action => [name,values]
    lexeme default = latm => 1

    Expression ::= Term action => ::first
    Term ::=
          Factor action => ::first
        | Term '+' Term action => do_add
    Factor ::=
          Number action => ::first
        | Factor '*' Factor action => do_multiply
    Number ~ digits
    digits ~ [\d]+
    :discard ~ whitespace
    whitespace ~ [\s]+
    END_OF_DSL

    my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );
    my $recce = Marpa::R2::Scanless::R->new(
        { grammar => $grammar, semantics_package => 'My_Actions' } );
    my $input = '42 * 1 + 7';
    $recce->read( \$input );

    my $value_ref = $recce->value;
    my $value = $value_ref ? ${$value_ref} : 'No Parse';

    sub My_Actions::do_add {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 + $t2;
    }

    sub My_Actions::do_multiply {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 * $t2;
    }

=for Marpa::R2::Display::End

=head1 Description

=head2 Overview

Marpa parses any language whose
grammar can be written in BNF.
That includes recursive grammars,
ambiguous grammars, infinitely ambiguous grammars and
grammars with useless or empty productions.
Marpa does both left- and right-recursion in linear time --
in fact if a grammar is in any class currently
in practical use, Marpa will parse it in linear time.

This document centers around a short tutorial
of the Scanless interface (SLIF).
This is the interface most suitable for beginners.
The SLIF is the most suitable interface for most advanced uses as well.

=head1 A simple calculator

The synopsis shows the code for an extremely simple calculator.
It handles only
addition and multiplication of integers.
The sections which follow explain, line by line, how it works.
The explanation will assume that the reader
understands BNF and the basics of grammars -- what rules are,
what symbols are, what the start symbol of a grammar is, etc.

=head2 Marpa::R2::Scanless::G::new

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    my $dsl = <<'END_OF_DSL';
    :default ::= action => [name,values]
    lexeme default = latm => 1
    
    Expression ::= Term action => ::first
    Term ::=
          Factor action => ::first
        | Term '+' Term action => do_add
    Factor ::=
          Number action => ::first
        | Factor '*' Factor action => do_multiply
    Number ~ digits
    digits ~ [\d]+
    :discard ~ whitespace
    whitespace ~ [\s]+
    END_OF_DSL

    my $grammar = Marpa::R2::Scanless::G->new( { source => \$dsl } );

=for Marpa::R2::Display::End

The code first creates a new SLIF grammar.
SLIF grammars are
L<C<Marpa::R2::Scanless:G>|Marpa::R2::Scanless::G> objects.
They are created
with the
L<Marpa::R2::Scanless:G::new|Marpa::R2::Scanless::G/"new">
constructor.
The arguments to
L<Marpa::R2::Scanless::G::new|Marpa::R2::Scanless::G/"new">
are references to
hashes of named arguments.
In the key/value pairs of these hashes,
the
hash key
is the
name of the argument,
and the
hash value
is the
value of the named argument.

In the example, there is only one named argument to the SLIF grammar constructor:
C<source>.
The value of C<source> must be a reference to a string in the
L<SLIF's domain-specific language|Marpa::R2::Scanless::DSL>
(DSL).
In this example, the DSL consists of several rules
and pseudo-rules.

=head2 The default pseudo-rule

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    :default ::= action => [name,values]
    lexeme default = latm => 1

=for Marpa::R2::Display::End

These two lines set useful defaults.
The first sets a default semantics,
one which is especially useful for development.
This is a finished script, so the default semantics
is not used much.
We'll talk about this more when we discuss
L<semantics at the end|/"Semantics">.

The second line sets the longest acceptable tokens match (LATM)
style of lexing,
which is what you'll almost always want.
It is not the default for historical reasons, so your scripts will
almost always start with this line.

=head2 A G1 rule

Next follows a G1, or structural rule.
Structural rules are the kinds of rules typically seen in BNF --
they describe the symbols which provide the structure of the grammar,
but leave out details of whitespace.
The SLIF also handles the lexical details in this example,
but it does it via L0 rules,
which we will see shortly.

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    Expression ::= Term action => ::first

=for Marpa::R2::Display::End

As is normal for BNF rules, this rule consists of 
a left hand side symbol ("C<Expression>"),
the BNF operator ("C<::=>")
and a series of right hand side (RHS) symbols.
There is always exactly one left hand side (LHS)
symbol.
There may be any number of RHS symbols.
In the case of an empty rule,
the number of RHS symbols would be zero.
In this rule, there is one RHS
symbol, "C<Term>".

The BNF operator ("C<::=>") is what makes this rule
a G1 (structural) rule.
Later we will see lexical rules, which will use
the match operator ("C<~>").

After the rule is an adverb:
C<< action => ::first >>.
We'll explain the purpose of the C<action>
adverbs when we discuss
L<semantics|/"Semantics">

=head2 More complicated G1 rules

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    Term ::=
          Factor action => ::first
        | Term '+' Term action => do_add

=for Marpa::R2::Display::End

This rule says that a C<Term> may be one of two
alternatives: either a C<Factor>
or two C<Term>'s separated by an addition operator.
Immediately following is another G1 rule defining
a C<Factor>.  It is very similar in form
to the one for C<Term>.

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    Factor ::=
          Number action => ::first
        | Factor '*' Factor action => do_multiply

=for Marpa::R2::Display::End

=head2 L0 rules

The structural rules define the high-level structure of the grammar,
and ignore details of whitespace, comments, etc.
Now we look at how the low-level, lexical issues are handled.
This very simple calculator language does not allow comments,
but it does define whitespace.

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

          :discard ~ whitespace
          whitespace ~ [\s]+

=for Marpa::R2::Display::End

The C<:discard> rule is a pseudo-rule, which tells Marpa to use whatever
it matches to separate G1 symbols,
but otherwise to ignore it -- to "discard" it.
C<whitespace> is defined in the next rule as a sequence of one or more spaces.

Note the match operator ("C<~>") in the rule defining whitespace.
It tells Marpa that this rule is lexical and
should be interpreted exactly as written, character
by character.

The C<whitespace> rule is a special kind of rule in two respects.
First, its RHS is followed by a quantifier ("C<+>"), which makes it
a sequence rule.
Aside from the quantifier,
sequence rules may only have a single symbol or character class on their RHS.
The plus quantifier ("C<+>") means a sequence of one or more items.
The star quantifier ("C<*>") is also allowed, and it indicates a sequence of
zero or more items.

The whitespace items are defined by a character class: C<[\s]>.
Marpa supports the same character classes, and the same character class syntax,
as Perl does.

The next pair of L0 rules define the C<Number> symbol

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

          Number ~ digits
          digits ~ [\d]+

=for Marpa::R2::Display::End

The above two rules say that a C<Number> is a sequence of one or more digits.
C<Number> is a lexeme -- a G1 symbol which is defined and recognized at the lexical
(L0) level.
In this example, there are three other lexemes:
C<whitespace>, and the addition and multiplication operators.

We've already looked at
the C<whitespace> lexeme, which will be discarded without being seen by G1.
The addition and multiplication operators were defined with single quoted strings
in the G1 rules.  As a reminder, here's the rule for C<Term> again:

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    Term ::=
          Factor action => ::first
        | Term '+' Term action => do_add

=for Marpa::R2::Display::End

In the above rule, the single-quoted string C<'+'> implicitly defines a L0 lexeme.
Something similar happens with the C<'*'> string in the rule for a C<Factor>.

The SLIF's lexer mostly "does what you mean".
While the input is being read, it looks for all lexemes defined in the DSL.
Almost always, you'll want Marpa to look only for tokens that are
actually acceptable to the parse.
Telling Marpa to do so was the purpose of this line:

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    lexeme default = latm => 1

=for Marpa::R2::Display::End

LATM means "longest acceptable tokens match".
(LATM is not the default for historical reasons.)

Among the acceptable tokens,
Marpa looks for longest matches -- if multiple tokens match,
the longest match is the winner.
Marpa tolerates ambiguity,
so one feature special to Marpa is that LATM is a longest acceptable B<tokens> match --
if more than one token is longest, all of them are considered in the parse.
The logic of SLIF lexing is described with more precision in the
L<SLIF overview document|Marpa::R2::Scanless>.

=head2 Marpa::R2::Scanless::R::new

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    my $recce = Marpa::R2::Scanless::R->new(
        { grammar => $grammar, semantics_package => 'My_Actions' } );

=for Marpa::R2::Display::End

C<Marpa::R2::Scanless::R::new> creates a new SLIF recognizer.
Its arguments are references to hashes of named arguments.
In this example the first named argument is
the required argument: "C<grammar>".
The value of the
C<grammar>
named argument must be a Marpa::R2 SLIF
grammar.

The second argument is optional, but you will use it frequently.
The "C<semantics_package>" named argument tells Marpa in which Perl package to
look for the closures implementing the semantics for this grammar.
We will talk more about this below.

=head2 Marpa::R2::Scanless::R::read

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    my $input = '42 * 1 + 7';
    $recce->read( \$input );

=for Marpa::R2::Display::End

To parse a string,
we use 
the C<Marpa::R2::Scanless::R::read()> method.
In its simplest form,
as here,
the C<Marpa::R2::Scanless::R::read()> takes a reference
to a string containing the input stream as its argument.

=head2 Marpa::R2::Scanless::R::value

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    my $value_ref = $recce->value;
    my $value = $value_ref ? ${$value_ref} : 'No Parse';

=for Marpa::R2::Display::End

The C<Marpa::R2::Scanless::R::value()> method returns
a reference to the parse result's value,
if there was a parse result.
If there was no parse result,
C<Marpa::R2::Scanless::R::value()>
returns
C<undef>.

We have yet to describe how the Marpa SLIF computes the value
of a parse.
In fact,
up to this point,
we have been skipping everything that had to do
with semantics.
Now it is time to go back to those features.

=head2 Semantics

The value of the parse result, as returned via the C<value()> method,
is determined by the parse's B<semantics>.
Marpa's semantics are the traditional ones:
The input is seen as a tree which takes its structure from the G1 rules.
(This is why the G1 rules are called structural.)
The value of the parse results from repeatedly evaluating nodes of this
tree, starting at the bottom, with the results of child nodes made available
to their parent node when the parent node is evaluated.

Parse trees are usually drawn upside-down
with their root at the top,
and their "leaves" at the bottom.
In Marpa::R2's SLIF, the "leaves" are the symbols
that the G1 (structural) rules share with the L0 (lexical)
rules.
The symbols shared by L0 and G1
are those lexemes which are not
discarded.
In this example, the lexemes visible to G1 are
C<Number> and two operators
which are specified
with a quoted string:
"C<+>"
and
"C<*>".

Marpa assigns values to the nodes of the tree,
starting with the leaves.
Marpa's "leaves" will always be L0 symbols,
and their value by default is the literal
value at their location in the input stream.
In the case of the two operators described
by quoted string,
the value is that quoted string.
That is, the value of 'C<+>' is
'C<+>',
and the value of 'C<*>' is
'C<*>'.
The value of C<Number> will be the portion
of the input that matched the C<[\d]+>
pattern.

Starting with the values for leaves,
Marpa::R2 moves recursively
"up" the tree to its root,
assigning a value to each node of the tree
based on the value of its child nodes.
Each non-leaf node corresponds to a G1 rule,
and the children of the non-leaf node
correspond to the RHS symbols
of the rule.
When the non-leaf node is valued, its value becomes
the value of its LHS symbol,
and this value will become
the value of
a RHS symbol of another node with one
exception.

The one exception, the node with a LHS symbol
that does not become a RHS
symbol,
is the value of the top (or "root") node.
The value of the top node becomes the
value of the parse,
and this is the parse result value
to which the C<value()> method
returns a reference.

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    :default ::= action => [name,values]

=for Marpa::R2::Display::End

Each non-leaf node determines its value with an action.
The default pseudo-rule allows you to specify the default action.
(It is a pseudo-rule because its LHS, "C<:default>", is a pseudo-symbol,
not a real one.)
Often actions are Perl functions, which in this context are
called Perl semantic closures.

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    my $recce = Marpa::R2::Scanless::R->new(
        { grammar => $grammar, semantics_package => 'My_Actions' } );

=for Marpa::R2::Display::End

Above we saw the C<semantics_package> named argument used when
constructing the SLIF recognizer.
As we noted, this specifies the package that is used
to find the Perl semantic closures.

In this example the default semantics,
as specified by the C<default_action> named argument,
come from a "array descriptor" named "C<[name,values]>".
This indicates that, by default, the value of a rule is to be
a reference to an array consisting of the rule's name,
followed by the values of its children.

In this case,
the semantics is not actually used,
and you would usually change it to something more convenient
for your application.
But "C<[name,values]>" is an excellent starting point when you're
first developing a DSL and, since this code is intended as a
template, we've kept it.
For more about array descriptors, see the L<semantics
document|Marpa::R2::Semantics/"Array descriptor actions">

The other way we specify semantics in this example
is by using an
C<action> adverb
for a RHS alternative.
We've seen the C<action> adverb several times,
but skipped over it.
Now it is time to look at it.

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    Term ::=
          Factor action => ::first
        | Term '+' Term action => do_add
    Factor ::=
          Number action => ::first
        | Factor '*' Factor action => do_multiply

=for Marpa::R2::Display::End

The "C<::first>" action indicates that the value of a rule is to be
the value of its first child,
that is, the value corresponding to the first symbol of the rule's RHS.
(In the case of an empty rule, the value would be a Perl C<undef>).
(The initial double colon indicates a reserved action.)

The action for the second RHS alternative defining C<Term> is C<do_add>,
and the action for the second RHS alternative defining C<Factor> is C<do_multiply>.
To implement these actions, we need to "resolve" their names --
map the action names into 
the Perl closures which actually carry out the semantics.

The C<semantics_package> specified the package where we can find the actions:
"C<My_Actions>".
So, to resolve the C<do_multiply> action,
Marpa looks for a closure whose fully qualified name is C<My_Actions::do_multiply>,
which it finds:

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    sub My_Actions::do_multiply {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 * $t2;
    }

=for Marpa::R2::Display::End

The C<do_add> action is resolved
to a Perl semantic closure
in much the same way:

=for Marpa::R2::Display
name: Landing page synopsis
partial: 1
normalize-whitespace: 1

    sub My_Actions::do_add {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 + $t2;
    }

=for Marpa::R2::Display::End

The Perl semantic closures are callbacks.
They are called as each node in a parse tree is evaluated.

Each Perl semantic closure is called with one or more arguments.
The first argument to a value action is always a per-parse-tree
object, which the callbacks can use as a scratchpad.
In this example, the per-parse-tree object is not used.
The remaining arguments will be the values of the node's "children" --
in other words, the values computed for each of its RHS symbols, in order.
If the action is for an empty rule,
the per-parse-tree object will be its only argument.

Every value action is expected to return a value.
With one exception,
this value is passed up to a parent node
as an argument.
The exception is the value for the start rule.
The return value for the start rule becomes
the parse result.

=head1 Tainted data

Marpa::R2 exists to allow its input to alter execution in
flexible and powerful ways.
Marpa should not be used with untrusted input.
In Perl' s taint mode, it is a fatal error to use Marpa's SLIF
interface
with a tainted grammar, a tainted input string,
or tainted token values.

=head1 Threads

When used in a thread-safe Perl, Marpa::R2 should be thread-safe,
with one important restriction:
All Marpa objects that share the same grammar must be created
and used within a single thread.

This restriction may be lifted someday, but
in practice it does not seem onerous.
Note that 
you can use
the same grammar in different
threads by creating grammars
that are exact
copies of each other,
one grammar per thread.

=head1 The Marpa:: namespace

The C<Marpa::> top-level namespace is reserved.
For extensions to Marpa,
one appropriate place is the C<MarpaX::> namespace.
This practice helps avoid namespace collisions,
and follows a CPAN standard, as exemplified by
the
C<DBIx::>
C<LWPx::>
and
C<MooseX::>
which are for extensions of, respectively,
DBI, LWP and Moose.

=head1 Other documents

This document gives a semi-tutorial overview of Marpa's Scanless interface (SLIF).
For more details about the SLIF,
there is an
L<overview|Marpa::R2::Scanless>,
and pages describing
L<its DSL|Marpa::R2::Scanless::DSL>,
L<its grammar methods|Marpa::R2::Scanless::G>,
and
L<its recognizer methods|Marpa::R2::Scanless::R>.

Marpa has two other interfaces.
L<The thin
interface|Marpa::R2::Advanced::Thin>
provides direct access to the underlying
Libmarpa C library.
Of the Perl interfaces to Marpa,
the thin interface is the most low-level.
The thin interface offers efficient access to the full power of the
Marpa parse engine,
but it requires the application to do a lot of the work itself.

Now discouraged,
the L<named argument inteface (NAIF)|Marpa::R2::NAIF>
was Marpa::R2's first interface.
It is a more traditional,
middle level interface which uses Perl calls
instead of a DSL.

L<Marpa::R2::Vocabulary> is intended as a quick refresher in
parsing terminology,
emphasizing how the standard terms are used
in the Marpa context.
Marpa's standard semantics are fully described in the
L<Marpa::R2::Semantics> document.
Techniques for tracing and for debugging your Marpa grammars
are described in the
L<Marpa::R2::Tracing> document and the
L<Marpa::R2::Progress> document.
For those with a theoretical bent,
my sources, and other useful references, are described in
L<Marpa::R2::Advanced::Bibliography>.

=head1 Author

Jeffrey Kegler

=head2 Why is it called "Marpa"?

Marpa is the name of the greatest of the Tibetan "translators".
In his time (the 11th century AD) Indian Buddhism was
at its height.
Marpa's generation of scholars was devoted
to producing Tibetan versions of Buddhism's Sanskrit scriptures.
Marpa became the greatest of them,
and today is known as Marpa Lotsawa: "Marpa the Translator".

=head2 Blatant plug

Marpa is a character in my novel, B<The God Proof>.
B<The God
Proof> centers around Kurt GE<ouml>del's proof of God's existence.
Yes, I<that> Kurt GE<ouml>del, and yes, he really did work out a
God Proof (it's in his I<Collected Works>, Vol. 3, pp. 403-404).
B<The God Proof> is available
as a free download (L<http://www.lulu.com/content/933192>).
It can be purchased in print form at Amazon.com:
L<http://www.amazon.com/God-Proof-Jeffrey-Kegler/dp/1434807355>.

=head1 Support

L<Marpa::R2> comes without warranty.
Support is provided
on a volunteer basis
through the standard mechanisms for CPAN modules.
L<The Support document|Marpa::R2::Support> has details.

=head1 Copyright and License

=for Marpa::R2::Display
ignore: 1

  Copyright 2014 Jeffrey Kegler
  This file is part of Marpa::R2.  Marpa::R2 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.

  Marpa::R2 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.

  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R2.  If not, see
  http://www.gnu.org/licenses/.

=for Marpa::R2::Display::End

=cut

# Local Variables:
#   mode: cperl
#   cperl-indent-level: 4
#   fill-column: 100
# End:
# vim: expandtab shiftwidth=4:
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)