The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Marpa::R3::Semantics - How Marpa::R3 evaluates parses

Synopsis

    use Marpa::R3;

    my $grammar = Marpa::R3::Grammar->new(
        {   bless_package => 'My_Nodes',
            source        => \(<<'END_OF_SOURCE'),
    :default ::= action => [values] bless => ::lhs
    lexeme default = action => [ start, length, value ]
        bless => ::name

    :start ::= Script
    Script ::= Expression+ separator => comma
    comma ~ [,]
    Expression ::=
        Number bless => primary
        | '(' Expression ')' bless => paren assoc => group
       || Expression '**' Expression bless => exponentiate assoc => right
       || Expression '*' Expression bless => multiply
        | Expression '/' Expression bless => divide
       || Expression '+' Expression bless => add
        | Expression '-' Expression bless => subtract

    Number ~ [\d]+
    :discard ~ whitespace
    whitespace ~ [\s]+
    # allow comments
    :discard ~ <hash comment>
    <hash comment> ~ <terminated hash comment> | <unterminated
       final hash comment>
    <terminated hash comment> ~ '#' <hash comment body> <vertical space char>
    <unterminated final hash comment> ~ '#' <hash comment body>
    <hash comment body> ~ <hash comment char>*
    <vertical space char> ~ [\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
    <hash comment char> ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
    END_OF_SOURCE
        }
    );


    my $recce = Marpa::R3::Recognizer->new( { grammar => $grammar } );

    my $input = '42*2+7/3, 42*(2+7)/3, 2**7-3, 2**(7-3)';
    $recce->read(\$input);
    my $value_ref = $recce->value();
    die "No parse was found\n" if not defined $value_ref;

    # Result will be something like "86.33... 126 125 16"
    # depending on the floating point precision
    my $result = ${$value_ref}->doit();

    package My_Nodes;

    sub My_Nodes::primary::doit { return $_[0]->[0]->doit() }
    sub My_Nodes::Number::doit  { return $_[0]->[2] }
    sub My_Nodes::paren::doit   { my ($self) = @_; $self->[1]->doit() }

    sub My_Nodes::add::doit {
        my ($self) = @_;
        $self->[0]->doit() + $self->[2]->doit();
    }

    sub My_Nodes::subtract::doit {
        my ($self) = @_;
        $self->[0]->doit() - $self->[2]->doit();
    }

    sub My_Nodes::multiply::doit {
        my ($self) = @_;
        $self->[0]->doit() * $self->[2]->doit();
    }

    sub My_Nodes::divide::doit {
        my ($self) = @_;
        $self->[0]->doit() / $self->[2]->doit();
    }

    sub My_Nodes::exponentiate::doit {
        my ($self) = @_;
        $self->[0]->doit()**$self->[2]->doit();
    }

    sub My_Nodes::Script::doit {
        my ($self) = @_;
        return join q{ }, map { $_->doit() } @{$self};
    }

About this document

This document describes the semantics for Marpa::R3.

What is semantics?

A parser is an algorithm that takes a string of symbols (tokens or characters) and finds a structure in it. Traditionally, that structure is a tree.

Rarely is an application interested only in the tree. Usually the idea is that the string "means" something: the idea is that the string has a semantics. Traditionally and most often, the tree is an intermediate step in producing a value, a value which represents the "meaning" or "semantics" of the string.

"Evaluating" a tree means finding its semantics. The rest of this document describes Marpa's methods for evaluating trees. Those of you who have dealt with other traditional parsers, such as yacc and bison, will find Marpa's approach familiar.

Instances

At the start of evaluation, semantics is associated with instances of productions or of lexemes. An instance is an occurrence in terms of G1 locations. Every instance has two locations: a start location and an end location.

A production is the LHS of a rule, together with one of its RHS alternatives. It corresponds to a BNF production. Unless a rule is a prioritized rule, it has exactly one production.

Prioritized rules very often only have one production, in which case they are called trivial prioritized rules. But prioritized rules may have many productions.

Nodes

In a parse tree, nodes are points where the tree branches or terminates. Tree terminations are also called terminals or "leaves".

Every production instance in a Marpa parse is represented by a branch point (or "node") in the tree. The topmost node of a tree is its "root node". (Trees are easiest to draw upside down, so traditionally in programming, the top of a tree is its root.)

A node, or branch point, "branches" into zero or more "child nodes". The node just above a child node, the one from which the child node branches out, is called its parent node.

If the node is for a non-quantified production instance, the parent node is the LHS of the production, and the child nodes are the RHS of the production. If the node is for a quantified rule, the parent node is the LHS of the quantified rule, and the child nodes are the items of the sequence of symbols on the right hand side. If the node is for a lexeme, the node represents the lexeme's symbol and there will be no child nodes.

A parent node can have zero or more children. Productions with zero children are nulled production instances, and are "leaf nodes". Leaf nodes are also called terminals. In Marpa's parse trees, every terminal is either a lexeme or a nulled production instance.

In Marpa, evaluation only takes place within the structural (G1) subgrammar, and the descriptions of the behaviors of production and lexeme instances below applies only to the G1 subgrammar. L0 productions and terminal symbols do not become nodes in the parse tree, and are never evaluated.

The order of node evaluation

The nodes of a Marpa parse tree are evaluated recursively, left-to-right and bottom-up. This means that, when a parent node is evaluated, the values of all child nodes are known and available for use by the semantics. The final value of a parse is the value of the top node of the parse tree.

Parse trees

The calls of the value() method by a recognizer produce a series of zero or more parses trees, called a parse series. A recognizer will have only one parse series, unless it calls the series_restart() method.

There may be zero parses in a parse series, because there may be no valid parse of a virtual input. There may be more than one parse in a parse series, because Marpa allows ambiguous parsing. Full details about the life cycle of a Marpa recognizer, including a full treatment of parse series can be found in another document.

Nulled subtrees

A nulled subtree is a subtree of the parse tree formed by a nulled node and its direct and indirect child nodes. (All these child nodes will also be nulled nodes.) Before evaluation, Marpa prunes all nulled subtrees back to their topmost nulled node. Of all the ways of dealing with nulled subtrees, this is the simplest and Marpa's users have found it a natural approach. More detail on the semantics of nulled symbols and subtrees can be found in a separate document.

Actions and how Marpa finds them

The way in which Marpa::R3 finds the value of a node is called that node's action. Actions can be explicit or implicit. An explicit action is one that is explicitly specified by the application, in one of the ways to be described below. A node's implicit action is the one it performs if it has no explicit action.

Lexeme actions

The implicit action for a lexeme is to return its literal value in the input stream, as a string. An explicit default action name for lexemes may be set using the the lexeme default statement. A lexeme action cannot be a Perl closure action -- it must be one of the built-in actions that are appropriate for lexemes.

Production actions

The implicit action for a production instance is that specified by the action descriptor [name,values]. Array descriptors are described in detail below. Peeking ahead, when a production's array descriptor is [name,values], that production returns a Perl array of n+1 elements, where n is the length of the RHS alternative. The first element is the "name" of the production. The remaining n elements are the values of the production's RHS children, in lexical order.

An explicit action for a RHS alternative can be specified using the action adverb for the its RHS alternative. A default explicit action for RHS alternatives can be specified with a default pseudo-rule.

Nulled symbol actions

As mentioned, nulled subtrees are pruned back to their topmost symbol. Lexemes are never nulled, so a nulled symbol is always the LHS of a production, and the action is determined from the production, as just described.

A complication arises if the symbol appears on the LHS of more than one nullable production. Because the symbol is nulled, the input is no help in determining which production to use. The production whose semantics are used for a nulled symbol is determined as follows:

  • If all nullable productions have the same semantics, that semantics is used.

  • If one of the nullable productions is empty (that is, has a zero-length RHS), then the empty production's semantics are used.

  • In the remaining case, two or more of the productions have different action names, but none of the alternatives has a zero-length RHS. When this happens, Marpa throws an exception. One easy way to fix the issue, is to add an empty rule with the intended semantics.

In determining whether the semantics of two nullable productions are "the same", the blessing is taken into account. Two productions are considered to have different semantics if they are blessed differently. Marpa::R3's null semantics are described in more detail in a separate document.

Blessings

Part of a production's or lexeme's action may be a blessing. A blessing is the name of a Perl package. In the case of a production evaluation closure, the argument containing its child values will be blessed into that package.

Not all actions are production evaluation closures. An action may be, for example, an array descriptor action. In cases where the action is not a production evaluation closure, the value of the action will be blessed into that package.

Only Perl objects pointed to by references can be blessed. It is a fatal error to try to use a blessing with an inappropriate action.

Implicitly (that is, if no blessing was explicitly specified), an action is not blessed. The implicit action itself cannot be blessed -- an attempt to do so is a fatal error.

Explicit blessings are made using the bless adverb. The bless adverb is allowed

  • for RHS alternatives;

  • for lexemes;

  • for the default lexeme statement;

  • and for the default pseudo-rule.

An L0 RHS alternative cannot have a bless adverb.

The value of a bless adverb is called a blessing. If the blessing is a Perl word (a string of alphanumerics or underscores), the name of the class will be formed by prepending the value of the bless_package named argument, followed by a double colon ("::").

If the blessing begins with a double colon ("::"), it is a reserved blessing. The reserved blessings are as follows:

::undef

The RHS alternatives or lexemes will not be blessed. When this document states that a RHS alternative or lexeme has a blessing of ::undef, it means exactly the same thing as when it states that a RHS alternative or lexeme will not be blessed. For both RHS alternatives and lexemes, the implicit blessing is ::undef.

::lhs

The RHS alternative is blessed into a class whose name is based on the LHS of the RHS alternative. A blessing of ::lhs is not allowed for a lexeme.

The class will be the name of the LHS with whitespace changed to an underscore. (As a reminder, the whitespace in symbol names will have been normalized, with leading and trailing whitespace removed, and all other whitespace sequences changed to a single ASCII space.) When a ::lhs blessing value applies to a production, it is a fatal error if the LHS contains anything other than alphanumerics and whitespace. In particular, the LHS cannot already contain an underscore ("_"). The ::lhs blessing is most useful in a default pseudo-rule.

::name

The lexeme is blessed into a class whose name is based on the name of the lexeme. The ::name blessing is not allowed for a RHS alternative.

The class is derived from the symbol name in the same way, and subject to the same restrictions, as described above for deriving a class name from the LHS of a production. The ::name reserved blessing is most useful in the lexeme default statement.

If any production or lexeme of a grammar has a blessing other than ::undef, a bless_package is required, and failure to specify one results in a fatal error.

Explicit actions

There are four kinds of explicit action names:

  • Array descriptors

  • Reserved action names

  • Perl identifiers

  • Perl names

An explicit action is either a built-in action or a Perl closure action. Array descriptors and reserved action names are built-in actions. The other actions are Perl closure actions.

Array descriptor actions

    lexeme default = action => [ start, length, value ]
        bless => ::name

If an action is enclosed in square brackets, it is an array descriptor, and the value of the lexeme or production will be an array. Inside the array descriptor is a comma separated list of zero or more array item descriptors. The array item descriptors are keywords that describe how the array is to be filled out.

If the array descriptor is an empty pair of square brackets ("[]"), then there are zero array item descriptors, and the value will be an empty array. Otherwise the array item descriptors are interpreted as lists and those lists are used to fill out the array.

g1length

The g1length array item descriptor puts a single-element list into the array. That one element will be the length of the production or lexeme instance, in G1 locations.

g1start

The g1start array item descriptor puts a single-element list into the array. That one element will be the G1 start location of the production or lexeme instance. Together the g1length and g1start array item descriptors describe a G1 location span.

Typical applications will prefer to use the start and length array item descriptors, which report their results in terms of physical input stream locations, instead of G1 locations. G1 locations are useful in special cases, for example with application which do not scan monotonically forward in the physical input, but instead jump backwards in it. G1 locations are described in detail in another document.

length

The length array item descriptor puts a single-element list into the array. That one element will be the length of the production or lexeme instance. Length is in characters.

lhs

The lhs array item descriptor puts a single-element list into the array. That one element will be the LHS symbol ID of the production. Because of historical reasons, for a lexeme instance, it will the symbol ID, but for a nulling symbol it will be a Perl undef.

name

The name array item descriptor puts a single-element list into the array. This will always be a string. For a production whose name is defined, that one element will be the production name. For an unnamed production, it will be the name of the LHS symbol. For a lexeme, it will be the symbol name of the lexeme. For a nulling symbol it will be the name of that symbol.

production_id

The production_id array item descriptor puts a single-element list into the array. For a production, that one element will be the production ID. In other cases, that one element will be a Perl undef.

start

The start array item descriptor puts a single-element list into the array. That one element will be the start location of the production or lexeme instance. The start location is an offset in the input string. The elements of the length and start item descriptors are defined such that the end location is always start location plus length.

symbol

The symbol array item descriptor puts a single-element list into the array. This will always be the name of a symbol. For a production, it will be the name of the LHS symbol. For a lexeme, it will be the symbol name of the lexeme. For a nulling symbol it will be the name of that symbol.

value

For a production, the value array item descriptor puts a list of zero or more elements into the array. The list will contain the values of the production's children, in left-to-right order.

For a lexeme, the value array item descriptor puts a single-element list into the array. That one element will be a list containing a single element, the token value of the lexeme.

values

The value and values array item descriptors are synonyms, and may be used interchangeably for both productions alternatives and lexemes.

Example

The array item descriptors fill out the array in the order in which they appear in the array descriptor. For example, if we are dealing with a production, and the array descriptor is "[ start, length, value ]", then the return value is an reference to an array, whose length will vary, but which will contain at least two elements. The first element will be the start location in the input string of this production, and the second will be its length. The remaining elements will be the values of the production instance's RHS children, in lexical order. If the production instance is nulled, the array will contain only two elements: start location and length.

Reserved action names

If the action value begins with a double colon ("::"), it is a reserved action. The following are recognized:

  • ::array

    ::array is equivalent to [values]. This means that, for both lexeme and production instances, the actions [values], [value] and ::array will do exactly the same thing.

  • ::first

    The value of the production instance is that of the production instance's first child. If there is no such child, the value is a Perl undef. It is a fatal error if a RHS alternative with a ::first action is blessed. It is also a fatal error to use a ::first action with a lexeme.

  • ::undef

    The value of the production or lexeme instance will be a Perl undef. It is a fatal error if a RHS alternative with an ::undef action is blessed.

Perl identifiers as action names

An action name is considered to be a Perl identifier, if it is a sequence of one or more alphanumerics and underscores. If the action name is a Perl identifier, it is treated as the name of a Perl variable. To successfully resolve to actions, Perl identifiers must be resolved to Perl names, as described below.

Perl names as action names

For this purpose, a Perl name is a series of two or more Perl identifiers separated by double colons ("::"). Note that, by this definition, a Perl name cannot start with a double colon. Action names starting with double colons are always treated as reserved action names.

Action names which are Perl names by this definition are treated as if they were fully qualified Perl names. Fully qualified Perl names are resolved to variables in Perl's namespace, as described below.

The semantics package

To resolve Perl identifiers to Perl names, a semantics package must be defined. The semantics package can be defined using the grammar's semantics_package named argument.

If the user wants the Perl variables implementing the semantics in the main namespace, she can specify "main" as the semantics package. This is fine for small scripts and applications. For a large project, it is usually good practice to keep Perl variables intended for use by Marpa's semantics in their own namespace.

Resolving Perl identifiers to Perl names

A Perl identifier is resolved to a Perl name by prepending the semantic package, followed by a double colon ("::"). For a Perl identifier to resolve successfully to a Perl name, a semantics package must be defined.

For example, if the action name is "some_var", the action name will be regarded as a Perl identifer. If the semantics package is "My_Actions", Marpa will convert the action name to "My_Actions::some_var", and hand it on for processing as a fully qualified Perl name.

Resolving Perl names to Perl variables

Once Marpa has a fully qualified Perl name, it looks in Perl's symbol tables for a Perl subroutine with that name. If Marpa finds a Perl subroutine with that fully qualified Perl name, the action name is resolved to that subroutine, which then becomes a production evaluation closure.

Executing production evaluation closures

A production evaluation closure action is always called in scalar context, and its return value will be used as the value of its node. A production evaluation closure always has exactly two arguments:

  • The first argument is the per-parse object.

  • The second argument is a reference to an array containing the values of the production's child nodes, in lexical order. If the production is nulled, the array will contain zero elements.

Quantified production nodes

Everything just said about production nodes applies to nodes for quantified productions. But there is a difference between quantified productions and others, and it a big one if you are writing a production evaluation closure.

In other productions, the right hand side is fixed in length, and therefore the number of child nodes is known in advance. This is not the case with a quantified production. The production evaluation closure for a quantified production must be capable of dealing with a variable number of child nodes.

Action context

    sub do_S {
        my ($per_parse_object) = @_;
        my $production_id         = $Marpa::R3::Context::production_id;
        my $slg             = $Marpa::R3::Context::grammar;
        my ( $lhs, @rhs ) =
            map { $slg->symbol_display_form($_) } $slg->production_expand($production_id);
        $per_parse_object->{text} =
              "production $production_id: $lhs ::= "
            . ( join q{ }, @rhs ) . "\n"
            . "locations: "
            . ( join q{-}, Marpa::R3::Context::g1_range() ) . "\n";
        return $per_parse_object;
    } ## end sub do_S

In addition to the per-parse argument and their child values, production evaluation closures also have access to context variables.

  • $Marpa::R3::Context::grammar is set to the grammar being parsed.

  • $Marpa::R3::Context::recognizer is set to the recognizer that did this parse.

  • $Marpa::R3::Context::valuer is set to the valuer that is doing this evaluation.

  • $Marpa::R3::Context::production_id is the ID of the current production. Given the production ID, an application can find its LHS and RHS symbols using the grammar's production_expand() method.

  • Marpa::R3::Context::g1_range() returns the start and end G1 locations of the current production instance.

  • Marpa::R3::Context::g1_span() returns the start and length of the current production instance. The start is a G1 location, and the length is in G1 locations.

Bailing out of parse evaluation

    my $bail_message = "This is a bail out message!";

    sub do_bail_with_message_if_A {
        my ($action_object, $values) = @_;
        my ($terminal) = @{$values};
        Marpa::R3::Context::bail($bail_message) if $terminal eq 'A';
    }

    sub do_bail_with_object_if_A {
        my ($action_object, $values) = @_;
        my ($terminal) = @{$values};
        Marpa::R3::Context::bail([$bail_message]) if $terminal eq 'A';
    }

The Marpa::R3::Context::bail() static method is used to "bail out" of the evaluation of a parse tree. It will cause an exception to be thrown. If its first and only argument is a reference, that reference is the exception object. Otherwise, an exception message is created by converting the method's arguments to strings, concatenating them, and prepending them with a message indicating the file and line number at which the Marpa::R3::Context::bail() method was called.

Visibility of Perl object actions

Resolution from a Perl name to a Perl function takes place when the grammar is created by the grammar's new() method. The action functions are called during the Evaluation Phase, by the by the recognizer's value() method. More details about timing of action resolution and call are given in a separate document.

The per-parse argument

The first argument of every production evaluation closure is the per-parse argument. This is initialized

  • To the argument to the recognizer's value() method, if that argument is defined.

  • Otherwise, to an empty hashref.

The per-parse argument is destroyed once the evaluation of the parse tree is finished. Between creation and destruction, the per-parse argument is not touched by Marpa's internals -- it is reserved for use by the application.

The primary way of passing data while evaluating a parse tree is purely functional -- results from child nodes are passed up to parent nodes. Applications can use the per-parse argument for data which does not conveniently fit the functional model. Symbol tables are one common example of data that is best handled outside the functional model.

Parse order

If a parse is ambiguous, all parses are returned, with no duplication. By default, the order is arbitrary, but it is also possible to control the order. Details are in the document on parse order.

Infinite loops

Grammars with infinite loops (cycles) are generally regarded as useless in practical applications. Due to lack of interest, Marpa::R3 does not currently support them, although Libmarpa itself, Marpa's thin interface and the NAIF all do. Those interested in knowing more can look at the document on the NAIF's support of infinitely ambiguous grammars.

COPYRIGHT AND LICENSE

  Marpa::R3 is Copyright (C) 2018, Jeffrey Kegler.

  This module is free software; you can redistribute it and/or modify it
  under the same terms as Perl 5.10.1. For more details, see the full text
  of the licenses in the directory LICENSES.

  This program is distributed in the hope that it will be
  useful, but without any warranty; without even the implied
  warranty of merchantability or fitness for a particular purpose.