The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Marpa::R2::Semantics::Rank - How ranks are computed

Synopsis

    my $source = <<'END_OF_SOURCE';
      :discard ~ ws; ws ~ [\s]+
      :default ::= action => ::array

      Top ::= List action => main::group
      List ::= Item3 rank => 3
      List ::= Item2 rank => 2
      List ::= Item1 rank => 1
      List ::= List Item3 rank => 3
      List ::= List Item2 rank => 2
      List ::= List Item1 rank => 1
      Item3 ::= VAR '=' VAR action => main::concat
      Item2 ::= VAR '='     action => main::concat
      Item1 ::= VAR         action => main::concat
      VAR ~ [\w]+

    END_OF_SOURCE

    my @tests = (
        [ 'a',                 '(a)', ],
        [ 'a = b',             '(a=b)', ],
        [ 'a = b = c',         '(a=)(b=c)', ],
        [ 'a = b = c = d',     '(a=)(b=)(c=d)', ],
        [ 'a = b c = d',       '(a=b)(c=d)' ],
        [ 'a = b c = d e =',   '(a=b)(c=d)(e=)' ],
        [ 'a = b c = d e',     '(a=b)(c=d)(e)' ],
        [ 'a = b c = d e = f', '(a=b)(c=d)(e=f)' ],
    );

    my $grammar = Marpa::R2::Scanless::G->new( { source => \$source } );
    for my $test (@tests) {
        my ( $input, $output ) = @{$test};
        my $recce = Marpa::R2::Scanless::R->new(
            {
                grammar        => $grammar,
                ranking_method => 'high_rule_only'
            }
        );
        $recce->read( \$input );
        my $value_ref = $recce->value();
        if ( not defined $value_ref ) {
            die 'No parse';
        }
        push @results, ${$value_ref};
    }
    sub flatten {
        my ($array) = @_;

        # say STDERR 'flatten arg: ', Data::Dumper::Dumper($array);
        my $ref = ref $array;
        return [$array] if $ref ne 'ARRAY';
        my @flat = ();
      ELEMENT: for my $element ( @{$array} ) {
            my $ref = ref $element;
            if ( $ref ne 'ARRAY' ) {
                push @flat, $element;
                next ELEMENT;
            }
            my $flat_piece = flatten($element);
            push @flat, @{$flat_piece};
        }
        return \@flat;
    }

    sub concat {
        my ( $pp, @args ) = @_;

        # say STDERR 'concat: ', Data::Dumper::Dumper(\@args);
        my $flat = flatten( \@args );
        return join '', @{$flat};
    }

    sub group {
        my ( $pp, @args ) = @_;

        # say STDERR 'comma_sep args: ', Data::Dumper::Dumper(\@args);
        my $flat = flatten( \@args );
        return join '', map { +'(' . $_ . ')'; } @{$flat};
    }

Description

This document describes rule ranking. Rule ranking plays a role in parse ordering, which is described in a separate document.

Lexeme locations and fenceposts

The lexeme locations are an ordered set of sets of lexemes. Numbering of lexeme locations is 0-based.

It is also useful to use an idea of locations between lexeme locations -- This is the classic "fencepost" issue -- sometimes you want to count sections of fence, and in other cases it is more convenient to count fenceposts.

Let the lexeme locations be from 0 to N. If I is greater than 1 and less than N, then lexeme fencepost I is before lexeme location I and after lexeme location I-1. lexeme fencepost 0 is before lexeme location 0. lexeme fencepost N is after lexeme location N

The above implies that

  • If there are N lexeme locations, there are N+1 lexeme fenceposts.

  • lexeme location I is always between lexeme fencepost I and lexeme fencepost I+1.

Lexeme locations and fenceposts are closely related to G1 locations.

Dotted Rules

To understand this document, it is important to understand what a dotted rule is. An acquaintance with dotted rules is also important in understanding Marpa's progress reports. Dotted rules are thoroughly described in the progress report documentation. This section repeats the main ideas from the perspective of this document.

Recall that a rule is a LHS (left hand side) and a RHS (right hand side). The LHS is always exactly one symbol. The RHS is zero or more symbols. Consider the following example of a rule, given in the syntax of Marpa's DSL.

    S ::= A B C

Dotted rules are used to track the progress of a parse through a rule. They consist of a rule and a dot position, which marks the point in the RHS which scanning has reached. It is called the dot position because, traditionally, the position is represented by a dot. The symbol before the dot is called the predot symbol.

The following is an example of a dotted rule. (The dot of a dotted rule is not part of Marpa's DSL but, when it is useful for illustration, we will use it in the notation in this document.)

    S ::= A B . C

In this rule, B is the predot symbol.

When the dot is after the last symbol of the RHS, the dotted rule is called a completion. Here is the completion for the above rule:

    S ::= A B C .

In this completion example, the symbol C is the predot symbol.

When the dot is before the first symbol of the RHS, the rule is called a prediction. Here is the prediction of the rule we've been using for our examples:

    S ::= . A B C

In predictions, there is no predot symbol.

Choicepoints

Informally a choicepoint is a place where the parser can decide among one or more parse choices. A choicepoint can be thought of either a set of parse choices, or as the tuple of the properties which all the parse choices for the choicepoint must have in common.

For a choicepoint to work, all the choices must have enough in common that each of them could be replaced with any other. Thought of as a tuple of properties, a choicepoint is a triple: (dp, start, current). In this triple,

  • dp is a dotted rule. The predot symbol of dp is the predot symbol of the choicepoint -- if dp is a prediction, there is no predot symbol. The rule of dp is the rule of the choicepoint. The dot position of dp is the dot position of the choicepoint.

  • start is the lexeme location where the dotted rule begins. start is sometimes called the origin of the choicepoint.

  • current is the lexeme location corresponding to the dot in dp. current is sometimes called the current location of the choicepoint.

The choicepoint is a prediction choicepoint if dp is a prediction. The choicepoint is a token choicepoint if it is not a prediction choicepoint and the predot symbol of dp is a token symbol. The choicepoint is a rule choicepoint if it is not a prediction choicepoint and the predot symbol of dp is the LHS of a rule. (Token symbols are never the LHS of a rule, and vice versa.)

Parse choices

As mentioned, a choicepoint can be seen as a set of one or more parse choices. From the point of view of the individual parse trees, the traversal is top-down and left-to-right.

Often, there is only one parse choice. When there is only one parse choice, the choicepoint is said to be trivial. Prediction and token choicepoints are always trivial. If all of the choicepoints of a parse are trivial, the parse is unambiguous.

Every rule choicepoint, and therefore every non-trivial choicepoint, has a set of parse choices associated with it. For a rule choicepoint, each parse choice is a duple: (predecessor, cause), where cause and predecesor are also choicepoints. The first element of the duple is the predecessor choicepoint, or predecessor, of the parse choice. The second element of the duple is the cause choicepoint, or cause of the parse choice.

Let res be a rule choicepoint and let (pred, cuz) be one of the parse choices of res. Then we say that res is the result of cuz and pred; and we say that cuz is one of the causes of res.

The predecessor of a parse choice represents a portion of the parse that "leads up to" the cause. The predecessor of a parse choice plays no role in ranking decisions, and this document will mostly ignore predecessors.

Causes

Since it is a choicepoint, a cause choicepoint must also be a triple. Let the result of cuz be res. Let the triple for for cuz be (cuz-dp, cuz-origin, cuz-current). Let the triple for for res be (res-dp, res-origin, res-current).

If res is the result of cuz, then

  • cuz-current is always the same as res-current. In other words, the current location of cuz is the same as the current location of res.

  • cuz-origin is before res-current. In other words, the origin of cuz is before the current location of res.

  • cuz-origin is at or after res-origin. In other words, the origin of cuz is at or after the origin of res. While it is not always properly between res-origin and res-current, cuz-origin is always in the range bounded by res-origin on one side, and by res-current on the other. cuz-origin can therefore be thought of as "in the middle" of the parse choice. For this reason, cuz-origin is often called the middle location of the parse choice.

  • The dotted rule of cuz will be a completion.

  • The LHS of the dotted rule of cuz will be the predot symbol of res.

Summing up the above, the causes of a rule choicepoint vary only by middle position and rule RHS. The causes of a rule choicepoint always share the same current lexeme location and rule LHS, and their dotted rules are always completions.

Rank is by Cause

The rank of a parse choice is that of its cause. The rank of a cause is the rank of its rule. Therefore, the rank of a parse choice is the rank of the rule of its cause. That last sentence is the most important sentence of this document, so we will repeat it:

    The rank of a parse choice is
    the rank of the rule of its cause.

Motivation

By the definition of the previous section, the rule of the choicepoint itself plays no direct role in ranking. Instead, the ranking is based on the rules of one of the choicepoint's child nodes. At first, this may seem unnecessarily roundabout, and even counter-intuitive, but on reflection, it will make sense.

Ranking cannot be based directly on the rank of the choicepoint's rule because every parse choice for that choicepoint shares the choicepoint's rule. We have not examined the internal structure of predecessors, but ranking is not based on them for the same reason: the rule of a predecessor is always the same as the rule of its result choicepoint, and therefore the rule of its predecessor is the same for every parse choice of a choicepoint.

On the other hand, the causes of parse choices can and often do differ in their rules. Therefore ranking is based on the rules of the causes of a choicepoint.

Note that ranking is only by direct causes. It might reasonably be asked, why not, at least in the case of a tie, look at causes of causes. Or why not resolve ties by looking at causes of predecessors?

Marpa's built-in rule ranking was chosen to be the most powerful system that could be implemented with effectively zero cost. Ranking by direct causes uses only information that is quickly and directly available, so that its runtime cost is probably not measurable.

Complexity was also an issue. If indirect causes are taken into account, we would need to specify which causes, and under what circumstances they are used. Only causes of causes? Or causes of predecessors? Depth-first, or breadth-first? Only in case of ties, or using a more complex metric? To arbitrary depth, or using a traversal that is cut off at some point?

Ranking by direct causes only means the answer to all of these questions is as simple as it can be. Once we get into analyzing the synopsis grammar in detail, the importance of simplicity will become clear.

To be sure, there are apps whose requirements justify extra overhand and extra complexity. For these apps, Marpa's ASF's allow full generality in ranking.

Examples

Our examples in this document will look at the ranked grammar in the synopsis, and at minor variations of it.

Longest highest, version 1

We will first consider the example as it is given in the synopsis:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top ::= List action => main::group
  List ::= Item3 rank => 3
  List ::= Item2 rank => 2
  List ::= Item1 rank => 1
  List ::= List Item3 rank => 3
  List ::= List Item2 rank => 2
  List ::= List Item1 rank => 1
  Item3 ::= VAR '=' VAR action => main::concat
  Item2 ::= VAR '='     action => main::concat
  Item1 ::= VAR         action => main::concat
  VAR ~ [\w]+

Longest highest ranking

The DSL in the synopsis ranks its items "longest highest". Here "items" are represented by the symbols, <Item3>, <Item2> and <Item1>. The "longest" choice is considered to be the one with the most lexemes. Working this idea out for this grammar, we see that the items should rank, from highest to lowest: <Item3>, <Item2> and <Item1>.

The non-trivial choicepoints

Several examples and their results are shown in the synopsis. If you study the grammar, you will see that the non-trivial choicepoints will be for the dotted rules:

    Top ::= List .
    List ::= List . Item3
    List ::= List . Item2
    List ::= List . Item1

The above are all of the potential dotted rules for result choicepoints.

The cause choicepoints

In all of these dotted rules, the predot symbol is <List>. Recall that cause choicepoints are always completions. Since the predot symbol of all the non-trivial choicepoints is <List>, the dotted rule of a cause of the non-trivial choicepoints may be any of

    List ::= Item1 .
    List ::= Item2 .
    List ::= Item3 .
    List ::= List Item1 .
    List ::= List Item2 .
    List ::= List Item3 .

In fact, if we study the grammar more closely, we will see that the only possible ambiguity is in the sequence of items, and that ambiguities always take the form "(v=)(v)" versus "(v=v)". Therefore the only causes of non-trivial parse choices are

    List ::= Item3 .
    List ::= Item1 .
    List ::= List Item3 .
    List ::= List Item1 .

The first non-trivial choicepoints

Further study of the grammar shows that the first non-trivial choice must be between two parse choices with these cause choicepoints:

    List ::= Item3 .
    List ::= Item1 .

The rule List ::= Item3 has rank 3, and it obviously outranks the rule List ::= Item1, which has rank 1. Our example uses high_rank_only ranking, and our example therefore will leave only this choice:

    List ::= Item3 .

With only one choice left, the resulting choicepoint becomes trivial, and, as defined above, that remaining choice is "longest", and therefore the correct one for the "longest highest" ranking.

The second and subsequent non-trivial choicepoints

We've looked at only the first non-trivial choice for our example code. Again examining the grammar, we see that the second and subsequent non-trivial choices will all be between parse choices whose causes have these two dotted rules:

    List ::= List Item3 .
    List ::= List Item1 .

The rule List ::= List Item3 has rank 3, and it obviously outranks the rule List ::= List Item1, which has rank 1. Our example uses high_rank_only ranking, and our example therefore will leave only this choice:

    List ::= List Item3 .

Conclusion

We have now shown that our example will reduce all choicepoints to a single choice, one which is consistent with "longest highest" ranking. Since all choicepoints are reduced to a single choice, the ranked grammar in unambiguous.

This analysis made a lot of unstated assumptions. Below, there is a "Proof of correctness". It deals with this same example, but proceeds much more carefully.

Shortest highest, version 1

Here we see the grammar of the synopsis, reworked for a "shortest highest" ranking. "Shortest highest" is the reverse of "longest highest".

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top ::= List action => main::group
  List ::= Item3 rank => 1
  List ::= Item2 rank => 2
  List ::= Item1 rank => 3
  List ::= List Item3 rank => 1
  List ::= List Item2 rank => 2
  List ::= List Item1 rank => 3
  Item3 ::= VAR '=' VAR action => main::concat
  Item2 ::= VAR '='     action => main::concat
  Item1 ::= VAR         action => main::concat
  VAR ~ [\w]+

Here are what the results will look like for "shortest highest".

    my @tests = (
        [ 'a',                 '(a)', ],
        [ 'a = b',             '(a=)(b)', ],
        [ 'a = b = c',         '(a=)(b=)(c)', ],
        [ 'a = b = c = d',     '(a=)(b=)(c=)(d)', ],
        [ 'a = b c = d',       '(a=)(b)(c=)(d)' ],
        [ 'a = b c = d e =',   '(a=)(b)(c=)(d)(e=)' ],
        [ 'a = b c = d e',     '(a=)(b)(c=)(d)(e)' ],
        [ 'a = b c = d e = f', '(a=)(b)(c=)(d)(e=)(f)' ],
    );

The reader who wants an example of a ranking scheme to work out for themselves may find this one suitable. The reasoning will very similar to that for the "longest highest" example, just above.

Longest highest, version 2

The previous examples have shown the rule involved in parse ranking in "spelled out" form. In fact, a more compact form of the grammar can be used, as shown below for "longest highest" ranking.

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top ::= List action => main::group
  List ::= Item rank => 1
  List ::= List Item rank => 0
  Item ::= VAR '=' VAR rank => 3 action => main::concat
  Item ::= VAR '='     rank => 2 action => main::concat
  Item ::= VAR         rank => 1 action => main::concat
  VAR ~ [\w]+

Shortest highest, version 2

This is the grammar for "shortest highest", in compact form:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top ::= List action => main::group
  List ::= Item rank => 0
  List ::= List Item rank => 1
  Item ::= VAR '=' VAR rank => 1 action => main::concat
  Item ::= VAR '='     rank => 2 action => main::concat
  Item ::= VAR         rank => 3 action => main::concat
  VAR ~ [\w]+

Again, the reader looking for simple examples to work out for themselves may want to rework the argument given above for the "spelled out" examples for the compact examples.

Alternatives to Ranking

Reimplementation as pure BNF

It is generally better, when possible, to write a language as BNF, instead of using ranking. The advantage of using BNF is that you can more readily determine exactly what language it is that you are parsing: Ranked grammars make look easier to analyze at first glance, but the more you look at them the more tricky you realize they are.

These "pure BNF" reimplementations rely on an observation used in the detailed proof of the ranked BNF example: The parse string becomes easier to analyze when you see it as a sequence of "var-bounded" substrings, where "var-bounded" means the substring is delimited by the start of the string, the end of the string, or the fencepost between two <VAR> lexemes.

Longest highest as pure BNF

Here is the "longest highest" example, reimplemented as BNF:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top            ::= Max_Boundeds action => main::group
  Top            ::= Max_Boundeds Unbounded action => main::group
  Top            ::= Unbounded action => main::group
  Max_Boundeds   ::= Max_Bounded+
  Max_Bounded    ::= Eq_Finals Var_Final3
  Max_Bounded    ::= Var_Final
  Unbounded      ::= Eq_Finals
  Eq_Finals      ::= Eq_Final+
  Var_Final      ::= Var_Final3 | Var_Final1
  Var_Final3     ::= VAR '=' VAR action => main::concat
  Eq_Final       ::= VAR '='     action => main::concat
  Var_Final1     ::= VAR         action => main::concat
  VAR ~ [\w]+

Shortest highest as pure BNF

We can also reimplement the "shortest highest" example as BNF. One of the advantages of a BNF (re)implementation, is that it often clarifies the grammar. For example in this case, we note that the DSL rule

  Var_Final3     ::= VAR '=' VAR action => main::concat

is, in fact, never used. We therefore omit it:

  :discard ~ ws; ws ~ [\s]+
  :default ::= action => ::array

  Top            ::= Max_Boundeds action => main::group
  Top            ::= Max_Boundeds Unbounded action => main::group
  Top            ::= Unbounded action => main::group
  Max_Boundeds   ::= Max_Bounded+
  Max_Bounded    ::= Eq_Finals Var_Final
  Max_Bounded    ::= Var_Final
  Unbounded      ::= Eq_Finals
  Eq_Finals      ::= Eq_Final+
  Eq_Final       ::= VAR '='     action => main::concat
  Var_Final      ::= VAR         action => main::concat
  VAR ~ [\w]+

Abstract Syntax Forests (ASFs)

Ranking can also be implemented using Marpa's abstract syntax forests. ASFs are likely to be less efficient than the built in ranking, but they are a more general and powerful solution.

Comparison with PEG

For those familiar with PEG, Marpa's ranking may seem familiar. In fact, Marpa's ranking can be seen as a "better PEG".

A PEG specification looks like a BNF grammar. It is a common misconception that a PEG specification implements the same language that it would if interpreted as pure, unranked BNF. This is the case only when the grammar is LL(1) -- in other words, rarely in practical use.

Typically, a PEG implementer thinks their problem out in BNF, perhaps ordering the PEG rules to resolve a few of the choices, and then twiddles the PEG specification until the test suite passes. This results in a parser whose behavior is to a large extent unknown.

Marpa with ranking allows a safer form of PEG-style parsing. Both Marpa's DSL and PEG allow the implementer to specify any context-free grammar, but Marpa parses all context-free grammars correctly, while PEG only correctly parses a small subset of the context-free grammars and fakes the rest.

In Marpa, the implementer of a ranked grammar can work systematically. He can first write a "superset grammar", and then "sculpt" it using ranks until he has exactly the language he wants. At all points, the superset grammar will act as a "safety net". That is, ranking or no ranking, Marpa returns no parse trees that are not specified by the BNF grammar.

This "sculpted superset" is more likely to result in a satisfactory solution than the PEG approach. The PEG specification, re-interpreted as pure BNF, will usually only parse a subset of what the implementer needs, and the implementer must use ranks to "stretch" the grammar to describe what he has in mind.

Ranking is not as predictable as specifying BNF. The "safety net" provided by Marpa's "sculpted superset" guards against over-liberal parsing. This is important: over-liberal parsing can be a security loophole, and bugs caused by over-liberal parsing have been the topic of security advisories.

Determining the exact language parsed by a ranked grammar in PEG is extremely hard -- a small specialist literature has looked at this subject. Most implementers don't bother trying -- when the test suite passes, they consider the PEG implementation complete.

Ranking's effect is more straightforward in Marpa. Above, we showed very informally that the example in our synopsis does what it claims to do -- that the parses returned are actually, and only, those desired. A more formal proof is given below.

Since Marpa parses all context-free grammars, and Marpa parses most unambiguous grammars in linear time, Marpa often parses your language efficiently when it is implemented as pure BNF, without rule ranking. As one example, the ranked grammar in the synopsis can be reimplemented as pure unranked BNF. Where it is possible to convert your ranked grammar to a pure BNF grammar, that will usually be the best and safest choice.

Details

This section contains additional explanations, not essential to understanding the rest of this document. Often they are formal or mathematical. While some people find these helpful, others find them distracting, which is why they are segregated here.

Earley parsing

In this section we assume that the reader is comfortable reading and analyzing BNF. We will often use the basic theorem of Earley parsing. As a reminder, the theorem states that an instance of a dotted rule is present in a parse, if and only if that instance is valid for the input so far. More formally:

(EARLEY): Let G be a grammar. Let Syms be the set of symbols in G and let exactly one of the symbols in Syms be distinguished as the "start symbol". Call the "start symbol", <Start>.

Let Terms be a subset of Syms called the "terminals" of G. Let W be a string of symbols from Terms. Let W[i] be the i'th terminal of W, so that the first terminal of W is W[1].

Where alpha and beta are sequences of symbols in Syms, let alpha => beta mean that alpha derives beta in grammar G in zero or more steps.

In this theorem, let alpha, beta, gamma, delta be sequences of zero or more symbols in Syms.

Theorem: These two statement sets are equivalent:

Statement set 1 is true if and only if we have all of the following:

1a. eitem is an instance of a dotted rule, which we will treat as a triple: dp, origin, current. In the literature, a triple of this form is also called an "Earley item".

1b. dp is the dotted rule <A> ::= beta . gamma>.

Statement set 2 is true if and only if we have all of the following:

2a. alpha => W[1] ... W[origin]

2b. beta => W[origin+1] .. W[current]

2c. <Start> => alpha <A> delta

The proof is omitted. It can be found in Jay Earley's thesis, in his original paper, and in Aho and Ullmann's 1972 textbook.

Proof of correctness

Let G be the grammar in the Marpa DSL synopsis, Let W be a string, and let ps-pure be the set of parses allowed by G, treated as if it was pure unranked BNF. Let ps-ranked be the set of parses allowed by the G after ranking is taken into account.

We will say string W is valid if and only if ps-pure is non-empty. We will say that a parse set is "unambiguous" if and only if it contains at most one parse.

In the less formal discussion above, we took an intuitive approach to the idea of "longest highest", one which made the unstated assumption that optimizing for "longest highest" locally would also optimize globally. Here we make our idea of "longest highest" explicit and justify it.

What "longest highest" means locally is reasonably obvious -- if item3 has more lexemes than item1, then item3 is "longer" than item1. It is less clear what it means globally. Items might overlap, so that optimizing locally might make the parse as a whole suboptimal.

We will define a "longest highest" parse of W as one of the parses of W with the fewest "items". This is based on the observations that W is fixed in length; and that, if items are longer, fewer of them will fit into W.

More formally, let p be a parse, and let ps be a parse set. Let Items(p) be the number of items in p. Let Items(ps), be the minimum number of items of any parse in ps. Then, if lh is a parse in ps and if, for every parse p in parse set ps, Items(lh) <= Items(p), we say that lh is a "longest highest" parse.

To conveniently represent token strings we will represent them as literal strings where "v" stands for a <VAR> lexemes and equal signs represent themselves. For example, we might represent the token string

    <VAR> = <VAR>

as the string

    v=v

Theorem: If W is a valid string, ps-ranked contains exactly one parse, and that parse is the "longest highest" parse.

Proof: Recall that a Marpa high_rank_only parse first parses according to the pure unranked BNF, and then prunes the parse using ranking.

Part 1: We will first examine ps-pure, the parse set which results from the pure BNF phase.

(1) "Item": An "item" will means an instance of one of the symbols Item1, Item2 or Item3. When we want to show a token string's division into items, we will enclose them in parentheses. For example "(v=)(v=)(v)" will indicate that the token string is divided into 3 items; and "(v=)(v=v)" will represent the same token string divided into 2 items.

(2) "Factoring": Analyzing the grammar G, we see that it is a sequence of items, and that G is ambiguous for a string W if and only if that string divides into items in more than one way. We will call a division of W into items a "factoring".

(3) "Independence": Let seq1 and seq2 be two factorings of W into items. Let first and last be two lexeme locations such that first <= last.

Let sub1 be a subsequence of one or more items of seq1 that starts at first and ends at last. Similarly, let sub2 be a subsequence of one or more items of seq2 that starts at first and ends at last. Let seq3 be the result of replacing sub1 in seq1 with sub2. Then, because G is a context free grammar, seq3 is also a factoring of W.

As an example, Let W be v=vv=v, let seq1 be (v=v)(v=v) and let seq2 be (v=)(v)(v=)(v). Let first be 4 and last be 6, so that sub1 is (v=v) and sub2 is (v=)(v). Then seq3 is (v=v)(v=)(v). We claimed that seq3 must be a factoring of W, and we can see that our claim is correct.

(4): Recall the discussion of lexeme fenceposts. We will say that a lexeme fencepost fp is a "factoring barrier" if it is not properly contained in any item. More formally, let First(it1) be first lexeme location of an item it1, and let Last(it1) be last lexeme location of it1. Let i be a lexeme fencepost i such that, for every item it, either i <= First(it) or i > Last(it). Then lexeme fencepost i is a factoring barrier.

(5): Where |W| is the length of W, we can see from (4) that fencepost |W| is a factoring barrier.

(6): Also from (4) we see that lexeme fencepost 0 is a factoring barrier.

(7): Let a "subfactoring" be a sequence of items bounded by factoring barriers.

(8): Let a "minimal subfactoring" be a subfactoring which does not properly contain any factoring barriers. From (5) and (6) we can see that W can be divided into one or more minimal subfactorings.

(9): We can see from G that two VARs never occur together in the same item. This means that wherever we do find two VARs in a row, the lexeme fencepost between them is a factoring barrier.

(10): No item begins with an equal sign ("="). Every item begins with a <VAR> token.

(11): Every minimal subfactoring starts with a <VAR> token. This is because every subfactoring starts with an item, and from (10).

(12): Every minimal subfactoring start with a <VAR> token and alternates equal signs and <VAR> lexemes: that is, it has the pattern "v=v=v= ...". This follows from (9) and (11).

(13): We will call a minimal subfactoring "var-bounded" or simply "bounded" if it starts and ends with a <VAR> token. By (12), this means that it has the pattern "v=v=v= ... v".

(14): We will call a minimal subfactoring "var-unbounded" or simply "unbounded" if it is not var-bounded. By (12), this means that it has the pattern "v=v= ... v=".

(15): If a <Item1> or <Item3> occurs in a minimal subfactoring, it is the last item in that subfactoring.

To show (15), assume for a reductio ad absurdam there is a minimal subfactoring with a non-final <Item1> or <Item3>. Call that subfactoring f, and call that item, it1.

(15a): Both <Item1> and <Item3> end with a <VAR> token, so that it1 ends in a <VAR> token.

(15b): Since it1, by assumption for the reductio, is non-final, there is a next item. Call the next item it2.

(15c): By (10), all items start with <VAR> lexemes, so that it2 starts with a <VAR> token.

(15d): From (15a) and (15c), we know that it1 ends with a <VAR> token and it2 starts with a <VAR> token, so that there are two <VAR> lexemes at the fencepost between it1 and it2.

(15e): From (9) and (15d), there is a factoring barrier at the fencepost between it1 and it2.

(15f): All items, including it1 and it2 and of non-zero length so that, from (15e), we know that the fencepost between it1 and it2 is properly inside subfactoring f.

(15g): From (15f) we see that there is a factoring barrier properly inside f. But f is assumed for the reductio to be minimal and therefore cannot properly contain a factoring barrier.

(15h): (15g) shows the reductio, and allows us to conclude that, if f contains an <Item1> or an <Item3>, that item must be the final item. This is what we needed to show for (15).

(16): By (3), every minimal subfactoring is independent -- in other words, no ambiguity crosses factoring barriers. So we narrow our consideration of ambiguities to two cases: ambiguities in unbounded minimal subfactorings, and ambiguities in bounded minimal subfactorings.

(17): Every unbounded minimal subfactoring is unambiguous. To show (17), let u be an unbounded minimal subfactoring. It follows from the definition of unbounded minimal subfactoring (14), that u ends in an equal sign. Both <Item1> and <Item3> end in <VAR> lexemes, so that no <Item1> or <Item3> can be the last item in u. Therefore, by (15), there is no <Item1> or <Item3> item in u. This means that every item in u is an <Item2> item. This in turn means that u is unambiguous, which shows (17).

(18): Every bounded minimal subfactoring parses either as a sequence of Item2's followed by an Item1; or as as a sequence of Item2's followed by an Item3. This follows from (1), (13) and (15).

(19): From (16), (17) and (18) it follows that every ambiguity between subfactorings is between bounded subfactorings whose divisions into items take the two forms

"(v=) ... (v=) (v)"

and

"(v=) ... (v=v)"

Part 2: We have now shown how pure BNF parsing works for the grammar in the synopsis. We next show the consequences of ranking.

(20): A completed instance of List ::= Item1 can only be found at lexeme location 1. We know this from G and (EARLEY).

(21): A completed instance of List ::= List Item1 can only be found at lexeme location 2 or after. We know this from G and (EARLEY).

(22) At most one dotted rule with Item1 as its last item appears in a cause at any choicepoint. We know this because all choices of a choicepoint must be completions with the same current location, and from (20) and (21).

(23): A completed instance of List ::= Item3 can only be found at lexeme location 3. We know this from G and (EARLEY).

(24): A completed instance of List ::= List Item3 can only be found at lexeme location 4 or after. We know this from G and (EARLEY).

(25) At most one dotted rule with Item3 as its last item appears in a cause at any choicepoint. We know this because all choices of a choicepoint must be completions with the same current location, and from (23) and (24).

(26): Consider the ambiguity shown in (19) and let current be the lexeme fencepost at its end. An item also ends at current so from G and (EARLEY), we know that there is at least one dotted rule instance whose predot symbol is List and whose current location is current. The choicepoint can have one of these dotted rules:

     Top ::= List .
     List ::= List . Item3
     List ::= List . Item2
     List ::= List . Item1

with location 0 as its origin and current as its current location. There may be more than one choicepoint fulfilling these criteria.

Of the choicepoints fulfilling the criteria, we arbitrarily pick one. We call this choicepoint result. We will see as we proceed that, while the choicepoints may have different dotted rules, our choice of one of them for result makes no difference in the ranking logic. The only properties of the result that we will use are dot position, predot symbol, and current location. These properties are the same regardless of which choicepoint we picked for result.

(27): Recall that the "middle location" of a choice is always the origin of its cause. From (26) we see that, for every possible rule in result, the predot symbol is the first symbol of the dotted rule. This means that the origin of the causes of result must be the same as the origin of result. From (26) we also know that the origin of result is always location 0. Therefore the origin of every one of the causes of result must be location 0. Therefore every one of the causes of result has the same middle location.

(28): From (26) and (27) we know that, if cuz is a cause of result, it has an origin at location 0; its current location is current; and its LHS is List. cuz may be any one of

     List ::= Item3 .
     List ::= Item2 .
     List ::= Item1 .
     List ::= List Item3 .
     List ::= List Item2 .
     List ::= List Item1 .

(29): From (19) and (EARLEY) we know that dotted rules ending in Item2 are not valid choices at location current.

(30): From (19), (22), (28) and (EARLEY) we know that exactly one of the following two dotted rules is the cause of a parse choice for result:

     List ::= Item1 .
     List ::= List Item1 .

(31): From (19), (25), (28) and (EARLEY) we know that exactly one of the following two dotted rules is the cause of a parse choice for result:

     List ::= Item3 .
     List ::= List Item3 .

(32): From (28), (29), (30) and (31) we know that we will have exactly two choices for result; that the final symbol of the cause in one will be Item1; and that the final symbol of the cause in the other will be Item3. From the grammar in the synopsis, we know that the cause ending in Item1 will have rank 1; and that the cause ending in Item3 will have rank 3. From this we see, since the ranking method is high_rank_only, that the choice whose final symbol is Item3 will be the only one kept in P-ranked. We further see from (19) that this choice will have fewer items than the alternative.

(33): From (19) we know that there is at most one ambiguity per minimal subfactoring. From (16) we know that ambiguity resolutions of a minimal subfactoring are independent of the ambiguity resolutions in other minimal subfactorings. From (32) we know that every subfactoring in P-ranked has exactly one factoring, and that that factoring is the one with the fewest items. Therefore, P-ranked will contain exactly one parse, and that that parse will be "longest highest".

QED.

Copyright and License

  Copyright 2022 Jeffrey Kegler
  This file is part of Marpa::R2.  Marpa::R2 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.

  Marpa::R2 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.

  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R2.  If not, see
  http://www.gnu.org/licenses/.