The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Marpa::R3::Glade - Low-level interface to Marpa's Abstract Syntax Forests (ASF's)

Synopsis

  my $grammar = Marpa::R3::Grammar->new(
      {   source => \(<<'END_OF_SOURCE'),
  :start ::= pair
  pair ::= duple | item item
  duple ::= item item
  item ::= Hesperus | Phosphorus
  Hesperus ::= 'a'
  Phosphorus ::= 'a'
  END_OF_SOURCE
      }
  );

  my $recce = Marpa::R3::Recognizer->new( { grammar => $grammar } );
  $recce->read( \'aa' );
  my $asf = Marpa::R3::ASF2->new( { recognizer => $recce } );
  die 'No ASF' if not defined $asf;
  my $output_as_array = asf_to_basic_tree($asf);
  my $actual_output   = array_display($output_as_array);

The code for asf_to_basic_tree() represents a user-supplied call using the interface described below. An full example of asf_to_basic_tree(), which constructs a Perl array "tree", is given below. array_display() displays the tree in a compact form. The code for it is also given below. The return value of array_display() is as follows:

    Glade 2 has 2 symches
      Glade 2, Symch 0, pair ::= duple
          Glade 6, duple ::= item item
              Glade 8 has 2 symches
                Glade 8, Symch 0, item ::= Hesperus
                    Glade 1, Hesperus ::= 'a'
                        Glade 14, Symbol 'a': "a"
                Glade 8, Symch 1, item ::= Phosphorus
                    Glade 13, Phosphorus ::= 'a'
                        Glade 17, Symbol 'a': "a"
              Glade 7 has 2 symches
                Glade 7, Symch 0, item ::= Phosphorus
                    Glade 9, Phosphorus ::= 'a'
                        Glade 22, Symbol 'a': "a"
                Glade 7, Symch 1, item ::= Hesperus
                    Glade 21, Hesperus ::= 'a'
                        Glade 25, Symbol 'a': "a"
      Glade 2, Symch 1, pair ::= item item
          Glade 8 revisited
          Glade 7 revisited

This interface is EXPERIMENTAL

This interface uses Marpa::R3's undocumented G1 level calls. It is experimental and will probably be replaced.

About this document

This document describes the low-level interface to Marpa's abstract syntax forests (ASF's). It assumes that you are already familiar with the high-level interface. This low-level interface allows the maximum flexiblity in building the forest, but requires the application to do much of the work.

Ambiguity: partitions versus symches

An abstract syntax forest (ASF) is similar to an abstract syntax tree (AST), but it has an additional ability -- it can represent an ambiguous parse. Ambiguity in a parse can come in two forms, and Marpa's ASF's treat the distinction as important. An ambiguity can be a symbolic choice (a symch), or a partition choice. Symbolic choices are the kind of ambiguity that springs first to mind -- a choice between rules, or a choice between a rule and token. Partition choices involve only one rule, but the RHS symbols of that rule divide the input up ("partition it") in different ways. I'll give examples below.

Symches and partitions are treated separately, because they behave very differently:

  • Symches are less common than partitions.

  • Factorings are frequently not of interest; symches are almost always of major interest.

  • Symches usually have just a few alternatives; the possible number of partitions easily grows into the thousands.

  • In the worst case, the number of symches is a constant that depends on size of the grammar. In the worst case, the number of partitions grows exponentially with the length of the string being partitioned.

  • The constant limiting the number of symches will almost always be of manageable size. The number of partitions can grow without limit.

An example of a symch

Hesperus is Venus's traditional name as an evening star, and Phosphorus (aka Lucifer) is its traditional name as a morning star. For the grammar,

    :start ::= planet
    planet ::= hesperus
    planet ::= phosphorus
    hesperus ::= venus
    phosphorus ::= venus
    venus ~ 'venus'

and the input string 'venus', the forest would look like

    Symbol #0 planet has 2 symches
      Symch #0.0
      GL2 Rule 0: planet ::= hesperus
        GL3 Rule 2: hesperus ::= venus
          GL4 Symbol venus: "venus"
      Symch #0.1
      GL2 Rule 1: planet ::= phosphorus
        GL5 Rule 3: phosphorus ::= venus
          GL6 Symbol venus: "venus"

Notice the tags of the form "GLn", where n is an integer. These identify the glade. Glades will be described in detail below.

The rules allow the string 'venus' to be parsed as either one of two planets: 'hesperus' or 'phosphorus', depending on whether rule 0 or rule 1 is used. The choice, at glade 2, between rules 0 and 1, is a symch.

An example of a partition

For the grammar,

    :start ::= top
    top ::= b b
    b ::= a a
    b ::= a
    a ~ 'a'

and the input 'aaa', a successful parse will always have two b's. Of these two b's one will always be short, deriving a string of length 1: 'a'. The other will always be long, deriving a string of length 2: 'aa'. But they can be in either order, which means that the two b's can divide up the input stream in two different ways: long string first; or short string first.

These two different ways of dividing the input stream using the rule

    top ::= b b

are called a partition. Here's Marpa's dump of the forest:

    GL2 Rule 0: top ::= b b
      Factoring #0
        GL3 Rule 2: b ::= a
          GL4 Symbol a: "a"
        GL5 Rule 1: b ::= a a
          GL6 Symbol a: "a"
          GL7 Symbol a: "a"
      Factoring #1
        GL8 Rule 1: b ::= a a
          GL9 Symbol a: "a"
          GL10 Symbol a: "a"
        GL11 Rule 2: b ::= a
          GL12 Symbol a: "a"

The structure of a forest

The terminology I use for ASF's pictures an ASF as a forest on a mountain. This mountain forest has glades, and there are paths between the glades. The term "glade" comes from the idea of a glade as a distinct place in a forest that is open to light.

The paths between glades have a direction -- they are always thought of as running one-way: downhill. If a path connects two glades, the one uphill is called an upglade and the one downhill is called a downglade.

There is a glade at the top of mountain called the "peak". The peak has no upglades.

The glade hierarchy

The glade hierarchy is a DAG ("directed acyclic graph") whose nodes are glades. DAGs are often called "trees", but the term tree is confusingly overloaded in this context. Internally, glade nodes have an 3-level internal structure made up of

  • one or more symches, which contain

  • zero or more partitions, which contain

  • one or more downglades.

When discussing symches and partitions inside the internal structure of a glade G, G will be called the containing glade.

If G is a glade, the downglades in the internal structure will be the child nodes of G in the DAG of glades. There may not be any downglades, in which case G will be a terminal in the DAG graph of glades.

Glades

Each glade node represents an instance of a symbol in one of the possible parse trees. This means that each glade has a symbol (called the "glade symbol"), and an "input span". An input span is an input start location, and a length in characters. Because it has a start location and a length, a span also specifies an end location in the input.

Each glade can be seen as a DAG of symches and partitions. This is called the internal DAG of the glade.

If each glade in the DAG of glades is expanded, so that each internal glade DAG becomes a sub-tree in the glade DAG, the result is the expanded glade DAG. In this context, the expanded glade DAG if often called the expanded DAG.

Symches

Every glade contains one or more symches. If a glade has only one symch, that symch is said to be unique. A symch is either a token symch or a rule symch. For a token symch, the glade symbol is the token symbol. For a rule symch, the glade symbol is the LHS of the rule.

Token symches

At most one of the symches in a glade can be a token symch. Token symches have no children. In the internal DAG of a glade, they are terminals.

Token symches are always unique. A token symch is a terminal in the internal DAG of its glade. A token symch is also a terminal in the expanded glade DAG.

Rule symches

There can be many rule symches in a glade -- one for every rule with the glade symbol on its LHS. As the name suggests, every rule symch has a rule associated with it. In every rule symch, the LHS of its rule will always be the same as the glade symbol of the glade that contains the rule symch.

Partitions

Every rule symch has one or more partitions. Each partition is an ordered set of cells, one for every symbol on the RHS of the rule. Each cell is mapped to a span of G1 locations.

A rule symch is said to contain its child partitions. In the parsing literature, the traditional term for a partition, is a "factoring". The term "factoring" is meant to suggest that it is a way of "dividing up" the input span of the containing glade among its RHS symbols. We prefer the term "partition" because it is less overloaded.

If a rule symch has only one partition, that partition is said to be unique. Because the number of partition can get out of hand, partitions may be omitted. A symch which omits partitions is said to be truncated. By default, every symch is truncated down to its first 42 partitions.

Downglades

Every partition has one or more cells. Each cell corresponds to a downglade. These cell-downglades are the children of the partition. The number of cells will be L, where L is the length of the RHS of the parent symch's rule. (The parent symch of a partition will always be a rule symch.)

Each cell is associated with a symbol and a G1 span. The symbol of the nth cell is the nth RHS symbol of the parent symch's rule. Since each cell has a symbol and a G1 span, each cell is a symbol instance. The symbol instance of a cell is the symbol instance of its associated downglade.

Within a glade, symches, partitions and downglades are tracked using zero-based indexes. The internal DAG of a glade has a 3-level structure -- symches, then partitiions, then downglades. Therefore, within a given containing glade G, every downglade can be uniquely identified as ($symch_ix, $partition_ix, $downglade_ix), where $symch_ix is a zero-based symch index, $partition_ix is a zero-based partition index, and $downglade_ix is a zero-based downglade index.

The glade ID

Each glade has a glade ID. This can be relied on to be a non-negative integer. A glade ID may be zero. Glade ID's are obtained from the "peak()" and "factoring_downglades()" methods.

Glade IDs are assigned dynamically as the ASF is traversed. This means the ID of a glade may be different, if the ASF is traversed differently. Programmers need to be aware that, while glade IDs can be relied on as constant unique identifiers with an ASF instance, they cannot be compared across ASF instances, even when those ASF's are based on the same parse tree.

Techniques for traversing ASF's

Memoization

When traversing a forest, you should take steps to avoid traversing the same glades twice. You can do this by memoizing the result of each glade, perhaps using its glade ID to index an array. When a glade is visited, the array can be checked to see if its result has been memoized. If so, the memoized result should be used.

This memoization eliminates the need to revisit the downglades of an already visited glade. Repeated subtraversals can happen when two glades share the same downglades, something that occurs frequently in ASF's.

Memoization does not eliminate multiple visits to a glade, but it does eliminate re-traversal of the glades downhill from it. In practice, the improvement in speed can be stunning. It will often be the difference between a program which is unuseably slow even for very small inputs, and one which is extremely fast even for large inputs.

The example in this POD includes a memoization scheme which is very simple, but adequate for most purposes. The main logic of its memoization is shown here.

        my ( $asf, $glade, $seen ) = @_;
        return bless ["Glade $glade revisited"], 'My_Revisit'
            if $seen->[$glade];
        $seen->[$glade] = 1;

Putting memoization in one of the very first drafts of your code will save you time and trouble.

Forest method

peak()

    my $peak = $asf->peak();

Returns the glade ID of the peak. This may be zero. All failures are thrown as exceptions.

Glade methods

glade_literal()

        my $literal = $asf->glade_literal($glade);

Returns the literal substring of the input associated with the glade. Every glade is associated with a span -- a start location in the input, and a length. On failure, throws an exception.

The literal is determined by the range. This works as expected if your application reads the input characters one-by-one, in order and without gaps except for characters that are normally discarded, such as whitespace. (We will call applications which read in this fashion, monotonic.) Most applications are monotonic, and yours is, unless you've taken special pains to make it otherwise. Computation of literal substrings for non-monotonic applications is addressed in "Literals and G1 spans" in Marpa::R3::Recognizer.

glade_g1_span()

    my ( $glade_start, $glade_length ) = $glade->g1_span();

Returns the G1 span of the input associated with the glade. Every glade is associated with a span -- a start location in the input, and a length. On failure, throws an exception.

glade_symch_count()

    my $symch_count = $asf->glade_symch_count($glade);

Requires a glade ID as its only argument. Returns the number of symches contained in the glade specified by the argument. On failure, throws an exception.

g1_glade_symbol_id()

    my $symbol_id    = $asf->g1_glade_symbol_id($glade);
    my $display_form = $grammar->g1_symbol_display_form($symbol_id);

Requires a glade ID as its only argument. Returns the symbol ID of the "glade symbol" for the glade specified by the argument. On failure, throws an exception.

Symch methods

g1_symch_rule_id()

    my $g1_rule_id = $asf->g1_symch_rule_id( $glade, $symch_ix );

Requires two arguments: a glade ID and a zero-based symch index. These specify a symch. If the symch specified is a rule symch, returns the rule ID. If it is a token symch, returns -1.

Returns a Perl undef, if the glade exists, but the symch index is too high. On other failure, throws an exception.

symch_is_truncated()

[ To be written. ]

symch_factoring_count()

    my $factoring_count =
        $asf->symch_factoring_count( $glade, $symch_ix );

Requires two arguments: a glade ID and a zero-based symch index. These specify a symch. Returns the count of partitions if the specified symch is a rule symch. This count will always be one or greater. Returns zero if the specified symch is a token symch.

Returns a Perl undef, if the glade exists, but the symch index is too high. On other failure, throws an exception.

Factoring methods

factoring_downglades()

    my $downglades =
        $asf->factoring_downglades( $glade, $symch_ix,
        $factor_ix );

Requires three arguments: a glade ID, the zero-based index of a symch and the zero-based index of a partition. These specify a partition. On success, returns a reference to an array. The array contains the glade IDs of the downglades in the partition specified.

Returns a Perl undef, if the glade and symch exist, but the partition index is too high. On other failure, throws an exception. In particular, exceptions are thrown if the symch is for a token; and if the glade exists, but the symch index is too high.

Methods for reporting ambiguity

    if ( $valuer->ambiguity_level() > 1 ) {
        my $asf = Marpa::R3::ASF2->new( { recognizer => $recce } );
        die 'No ASF' if not defined $asf;
        my $ambiguities = Marpa::R3::Internal_ASF2::ambiguities($asf);

        # Only report the first two
        my @ambiguities = grep {defined} @{$ambiguities}[ 0 .. 1 ];

        $actual_value = 'Application grammar is ambiguous';
        $actual_result =
            Marpa::R3::Internal_ASF2::ambiguities_show( $asf, \@ambiguities );
        last PROCESSING;
    }

ambiguities()

    my $ambiguities = Marpa::R3::Internal_ASF2::ambiguities($asf);

Returns a reference to an array of ambiguity reports in the ASF. The first and only argument must be an ASF object. The array returned will be be zero length if the parse was not ambiguous. Ambiguity reports are as described below.

While the ambiguities() method can be called to determine whether or not ambiguities exist, it is the more expensive way to do it. The $valuer->ambiguity_level() method tests an already-existing boolean and is therefore extremely fast. If you are simply testing for ambiguity, use the ambiguity_level() method instead. If you can save time when you know that a parse is unambiguous, you may want to test for ambiguity with the ambiguity_level() method before calling the ambiguities() method.

ambiguities_show()

  $actual_result =
    Marpa::R3::Internal_ASF2::ambiguities_show( $asf, \@ambiguities );

Returns a string which contains a description of the ambiguities in its arguments. Takes two arguments, both required. The first is an ASF, and the second is a reference to an array of ambiguities, in the format returned by the ambiguities() method.

Major applications will often have their own customized ambiguity formatting routine, one which can formulate error messages based, not just on the names of the rules and symbols, but on knowledge of the role that the rules and symbols play in the application. This method is intended for applications which do not have their own customized ambiguity handling. For those which do, it can be used as a fallback for handling those reports that the customized method does not recognize or that do not need special handling. The format of the returned string and is subject to change, and is intended for reading by humans only.

Ambiguity reports

The ambiguity reports returned by the ambiguities() method are of two kinds: symch reports and partition reports. Only ambiguities uppermost on a path are reported -- in other words, an ambiguity is not reported if it is downhill. Within glade, all of the partitions are considered downhill from the symches, so that if a glade has a symch ambiguity, there will be no partition ambiguity reports for that glade.

Typically, when there is more than one kind of ambiguity in an input span, only one is of real interest, and symch ambiguities are usually of more interest than partitions. And if one ambiguity is uphill from another, the downhill ambiguity is usually a side effect of the uphill one and of little interest.

Symch reports

A symch report is issued whenever, in a top-down traversal of the ASF, a glade is encountered which has more than one symch. A symch report takes the form

   [ 'symch', $glade ]

where $glade is the ID of the glade with the symch ambiguity. With this and the accessor methods in this document, an application can report full details of the symch ambiguity.

Factoring reports

A partition report identifies a sequence of RHS symbols which has more than one partition. Factoring reports identify not just a rule, but specific subsequences within the RHS which are multiply partitioned -- multipartitioned stretches. Marpa goes down to the symbol level, because a Marpa RHS can be very long. Marpa's sequence rules, especially, can have long stretches where the symbols are in sync with each other, broken by other stretches where they are out of sync. (A detailed definition of multipartitioned stretches is below.)

A partition report takes the form

    [ 'partition', $glade, $symch_ix, $rhs_ix1, $factor_ix2, $rhs_ix2 ];

where $glade is the ID of the glade with the partition ambiguity, and $symch_ix is the index of the symch involved. The multipartitioned stretch is described by two "identifying downglades". Both downglades are at the beginning of the stretch, and both will have the same input start location. The identifying downglades will differ in length.

The first of the two identifying factors has a partition index of 0, and its downglade index is $rhs_ix1. The second identifying factor has a partition index of $factor_ix2, and its downglade index is $rhs_ix2.

The identifying downglades will usually be enough for error reporting, which is the usual application of these reports. A multipartitioned stretch can be extremely large. Full details of it can be found by following up on the information in the partition report using the accessor methods described in this document.

The code for the synopsis

The asf_to_basic_tree() code

  sub asf_to_basic_tree {
      my ( $asf, $glade ) = @_;
      my $peak = $asf->peak();
      return glade_to_basic_tree( $asf, $peak, [] );
  } ## end sub asf_to_basic_tree

  sub glade_to_basic_tree {
      my ( $asf, $glade, $seen ) = @_;
      return bless ["Glade $glade revisited"], 'My_Revisit'
          if $seen->[$glade];
      $seen->[$glade] = 1;
      my $grammar     = $asf->grammar();
      my @symches     = ();
      my $symch_count = $asf->glade_symch_count($glade);
      SYMCH: for ( my $symch_ix = 0; $symch_ix < $symch_count; $symch_ix++ ) {
          my $g1_rule_id = $asf->g1_symch_rule_id( $glade, $symch_ix );
          if ( $g1_rule_id < 0 ) {
              my $literal      = $asf->glade_literal($glade);
              my $symbol_id    = $asf->g1_glade_symbol_id($glade);
              my $display_form = $grammar->g1_symbol_display_form($symbol_id);
              push @symches,
                  bless [qq{Glade $glade, Symbol $display_form: "$literal"}],
                  'My_Token';
              next SYMCH;
          } ## end if ( $g1_rule_id < 0 )

          # ignore any truncation of the partitions
          my $factoring_count =
              $asf->symch_factoring_count( $glade, $symch_ix );
          my @symch_description = ("Glade $glade");
          push @symch_description, "Symch $symch_ix" if $symch_count > 1;
          push @symch_description, $grammar->g1_rule_show($g1_rule_id);
          my $symch_description = join q{, }, @symch_description;

          my @factorings = ($symch_description);
          for (
              my $factor_ix = 0;
              $factor_ix < $factoring_count;
              $factor_ix++
              )
          {
              my $downglades =
                  $asf->factoring_downglades( $glade, $symch_ix,
                  $factor_ix );
              push @factorings,
                  bless [ map { glade_to_basic_tree( $asf, $_, $seen ) }
                      @{$downglades} ], 'My_Rule';
          } ## end for ( my $factor_ix = 0; $factor_ix < $factoring_count...)
          if ( $factoring_count > 1 ) {
              push @symches,
                  bless [
                  "Glade $glade, symch $symch_ix has $factoring_count partitions",
                  @factorings
                  ],
                  'My_Factorings';
              next SYMCH;
          } ## end if ( $factoring_count > 1 )
          push @symches, bless [ @factorings[ 0, 1 ] ], 'My_Factorings';
      } ## end SYMCH: for ( my $symch_ix = 0; $symch_ix < $symch_count; ...)
      return bless [ "Glade $glade has $symch_count symches", @symches ],
          'My_Symches'
          if $symch_count > 1;
      return $symches[0];
  } ## end sub glade_to_basic_tree

The array_display() code

Because of the blessings in this example, a standard dump of the output array is too cluttered for comfortable reading. The following code displays the output from asf_to_basic_tree() in a more compact form. Note that this code makes no use of Marpa, and works for all Perl arrays. It is included for completeness, and as a simple example of array traversal.

    sub array_display {
        my ($array) = @_;
        my ( undef, @lines ) = @{ array_lines_display($array) };
        my $text = q{};
        for my $line (@lines) {
            my ( $indent, $body ) = @{$line};
            $indent -= 6;
            $text .= ( q{ } x $indent ) . $body . "\n";
        }
        return $text;
    } ## end sub array_display

    sub array_lines_display {
        my ($array) = @_;
        my $reftype = Scalar::Util::reftype($array) // '!undef!';
        return [ [ 0, $array ] ] if $reftype ne 'ARRAY';
        my @lines = ();
        ELEMENT: for my $element ( @{$array} ) {
            for my $line ( @{ array_lines_display($element) } ) {
                my ( $indent, $body ) = @{$line};
                push @lines, [ $indent + 2, $body ];
            }
        } ## end ELEMENT: for my $element ( @{$array} )
        return \@lines;
    } ## end sub array_lines_display

Details

This section contains details not essential to understanding the main topic of this document. These details include restatements of what is said above in more formal terms, and formal statements of concepts that have been left to the intuition. Some readers find these details helpful, while others find them distracting. Segregating these details here serves the needs of both.

An alternative way of defining glade terminology

Here's a way of defining some of the above terms which is less intuitive, but more precise. First, define the glade length from glades A to glade B in an ASF as the number of glades on the shortest path from A to B, not including glade A. (Recall that paths are directional.) If there is no path between glades A and B, the glade length is undefined. Glade B is a downglade of glade A, and glade A is an upglade of glade B, if and only if the glade length from A to B is 1.

A glade A is uphill with respect to glade B, and a glade B is downhill with respect to glade A, if and only if the glade length from A to B is defined.

A peak of an ASF is a node without upglades. By construction of the ASF, there is only one peak.

Maximum symches per glade

Above, the point is made that the number of symches in a glade, even in the worst case, is a very manageable number. For a particular case, it is not hard to work out the exact maximum. Here are the details.

There can be at most one token symch, and if there is a token symch, it will be the only symch. There can be only one rule symch for every rule. In addition, all rules in a glade must have the glade symbol as their LHS. Therefore, the maximum number of symches in a glade is the number of rules with the glade symbol on their LHS.

Alignment and input stream locations

Because G1 spans map to input stream spans, symch location sets that are aligned or synced in G1 terms will be aligned or synced from the input stream point of view as well. But, if the input is not monotonic, the opposite is not necessarily the case. Because several G1 locations can share an input stream location, symch location sets that seem aligned or synced from the input stream point of view may not be considered aligned or synced from the G1 location point.

Alignment and RHS indexes

Symch location sets that are aligned from the G1 location point are not necessarily aligned from the point of the RHS indexes. Non-aligned RHS indexes are particularly likely for long sequence rules, where the RHS is a long string containing many repetitions of a single symbol.

COPYRIGHT AND LICENSE

  Marpa::R3 is Copyright (C) 2018, Jeffrey Kegler.

  This module is free software; you can redistribute it and/or modify it
  under the same terms as Perl 5.10.1. For more details, see the full text
  of the licenses in the directory LICENSES.

  This program is distributed in the hope that it will be
  useful, but without any warranty; without even the implied
  warranty of merchantability or fitness for a particular purpose.