The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Marpa::R3::Ext_Scan - External scanning

Synopsis

    my @pause_location;
    my $recce = Marpa::R3::Scanless::R->new(
        {
            grammar        => $parser->{grammar},
            event_handlers => {
                'before lstring' => sub () {
                    ( undef, undef, undef, @pause_location ) = @_;
                    'pause';
                },

            }
        }
    );
    my $length = length $string;
    for (
        my $pos = $recce->read( \$string );
        $pos < $length;
        $pos = $recce->resume()
        )
    {
        my $start = $pause_location[1];
        my $length = $pause_location[2];
        my $value = substr $string, $start + 1, $length - 2;
        $value = decode_string($value) if -1 != index $value, '\\';
        $recce->lexeme_read_block( 'lstring', $value, undef, $start, $length ) // die;
    } ## end for ( my $pos = $recce->read( \$string ); $pos < $length...)
    my $per_parse_arg = bless {}, 'MarpaX::JSON::Actions';
    my $value_ref = $recce->value($per_parse_arg);
    return ${$value_ref};

About this document

This page describes external scanning. By default, Marpa::R3 scans based on the L0 grammar in its DSL. This DSL-driven scanning is called internal scanning.

But many applications find it useful or necessary to do their own scanning in procedural code. In Marpa::R3 this is called external scanning. External scanning can be used as a replacement for internal scanning. Marpa::R3 also allows application to switch back and forth between internal and external scanning.

Tokens

In external scanning, the app controls tokenization directly. External scanning might also be called one-by-one scanning because, in external scanning, the app feeds tokens to Marpa::R3 one-by-one. This differs from internal scanning -- in internal scanning Marpa::R3 tokenizes a string for the app.

Every token must have three things associated with it:

  1. A symbol name, which is required. The symbol name must be the name of a lexeme in both the L0 and G1 grammars. The symbol name tells the parser which symbol represents this token to the Marpa semantics. The symbol name, in other words, connects the token to the grammar.

  2. A symbol value or value, which may be undefined. The value of the token is also seen by the semantics.

  3. A literal equivalent, which is required and must be a span in the input. The literal equivalent of a token is not directly visible to the semantics, although in Marpa::R3 it can always be accessed, if desired. The literal equivalent is needed for the messages produced by tracing, debugging, error reporting, etc. If more than one token is accepted at a G1 location -- which can happen if tokens are ambiguous -- all of the tokens must have the same literal equivalent.

Completion and non-completion methods

If a method might complete external scanning at a G1 location, that method is called an external scanning completion method, or just a completion method. Any other external scanning method is called a non-completion method.

There are only two non-completion external scanning methods: lexeme_alternative() and lexeme_alternative_literal(). These are low-level methods which prepare a list of tokens for the lexeme_complete() completion method.

Completion method details

The external scanning completion methods have almost all of their behaviors in common. For convenience, therefore, the usual behaviors of the completion methods are described in the section, and exceptions to these behaviors are noted in the descriptions of the individual methods.

External scanning completion can succeed or fail. If external scanning completion fails, the failure may be hard or soft. The only soft failure that can occur in external scanning completion is the rejection of a token.

Block location

Every external scanning completion must have a valid block span, unless that completion results in a hard failure. How that valid block span is specified varies by method. For the purposes of this section, let that block span be <$block_id, $offset, $length>. Also for the purposes of this section, we will define eolexeme, or "end of lexeme", as $offset + $length.

  • If external scanning completion succeeds and no event occurs, the current block is set to $block_id. The current offset is set to eolexeme. The current eoread of the current block will not be changed.

  • If external scanning completion succeeds and an event occurs, the current block is set to $block_id. The current offset is set to the event location. The event location will be the same as eolexeme, unless the event was a pre-lexeme event. The current eoread of the current block will not be changed.

  • This is a special case of the immediately preceding case. If external scanning completion succeeds and a pre-lexeme event occurs, the current block is set to $block_id. The current offset is set to the event location, which will be the same as $offset. The current eoread of the current block will not be changed.

  • If an external scanning completion method rejects a token, then the external scanning completion results in a soft failure. In this case the current block data remains unchanged.

  • Any failure in external scanning completion, other than token rejection, is a hard failure. In the case of a hard failure, no guarantee is made about the current block data. Marpa::R3 will attempt to leave the current block data valid and pointing to an "error location" -- that is, a location as relevant as possible to the error.

G1 location

If external scanning completion succeeds, and a pre-lexeme event does not occur, a token is read and Marpa::R3 advances the current G1 location by one. The token just read will start at the previous G1 location and end at the new current G1 location. The G1 location of the token will be considered to be the new current G1 location.

If external scanning completion succeeds, and a pre-lexeme event does occur, no token is read. The current G1 location will remain where it was before external scanning.

If external scanning completion has a soft failure, no token is read. The current G1 location will remain where it was before external scanning.

If external scanning completion has a hard failure, no guarantee is made about the current G1 location. Marpa::R3 will attempt to leave it valid and unchanged.

Event handlers

Parse events may occur during external scanning completion. The event handlers will see a G1 location and an event location as described above.

Mixing internal and external scanning

External scanning can be mixed with internal scanning to get the best of both. An application can terminate the internal scanning of the read() method early, if it has defined an parse event and that parse event triggers. Afterwards, internal scanning can be resumed with the resume() method. For details, see the description of resume(), as well as the separate document for events.

High-level mutators

Most applications doing external scanning will want to use the high-level methods. The $recce->lexeme_read_string() method allows the reading of a string, where the string is both the literal equivalent of the input, and its value for semantics. The $recce->lexeme_read_literal() method is similar, but the string is specified as a block span.

The $recce->lexeme_read_block() method is the most general of the high-level external scanning methods. lexeme_read_block() allows the app to specify the literal equivalent and the value separately.

lexeme_read_block()

    my $ok = $recce->lexeme_read_block($symbol_name, $value,
        $main_block, $start_of_lexeme, $lexeme_length);
    die qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
      $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
          if not defined $ok;

lexeme_read_block() is the basic method for external scanning. It takes five arguments, only the first of which is required. Call them, in order, $symbol_name, $value, $block_id, $offset, and $length.

The $symbol_name argument is the name of the symbol to scan. The $value argument will be the value of the token. If $value is missing or undefined, the value of the token will be a Perl undef. The $block_id, $offset, and $length arguments are the literal equivalent of the token, as a block span. lexeme_read_block() is an external scanning completion method and details of its behavior are as described above.

Return values: On success, lexeme_read_block() returns the new current offset. Soft failure occurs if and only if the token was rejected. On soft failure, lexeme_read_block() returns a Perl undef. Other failures are thrown as exceptions.

    $recce->lexeme_read_block($symbol, $start, $length, $value)

is roughly equivalent to

    sub read_block_equivalent {
        my ( $recce, $symbol_name, $value, $block_id, $offset, $length ) = @_;
        return if not defined $recce->lexeme_alternative( $symbol_name, $value );
        return $recce->lexeme_complete( $block_id, $offset, $length );
    }

lexeme_read_literal()

    my $ok = $recce->lexeme_read_literal($symbol_name, $main_block, $start_of_lexeme, $lexeme_length);
    die qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
       $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
           if not defined $ok;

lexeme_read_literal() takes four arguments, only the first of which is required. Call them, in order, $symbol_name, $block_id, $offset, and $length. The $symbol_name argument is the name of the symbol to scan. The $block_id, $offset, and $length arguments are the literal equivalent of the token, as a block span. The value of the token will be the same as its literal equivalent. lexeme_read_literal() is an external scanning completion method and details of its behavior are as described above.

    $recce->lexeme_read_literal($symbol, $start, $length, $value)

is roughly equivalent to

    sub read_literal_equivalent_hi {
        my ( $recce, $symbol_name, $block_id, $offset, $length ) = @_;
        my $value = $recce->literal( $block_id, $offset, $length );
        return $recce->lexeme_read_block( $symbol_name, $value, $block_id, $offset, $length );
    }

In terms of low-level external scanning methods, the above is roughly equivalent to

    sub read_literal_equivalent_lo {
        my ( $recce, $symbol_name, $block_id, $offset, $length ) = @_;
        return if not defined $recce->lexeme_alternative_literal( $symbol_name );
        return $recce->lexeme_complete( $block_id, $offset, $length );
    }

lexeme_read_string()

    my $ok = $recce->lexeme_read_string( $symbol_name, $lexeme );
    die qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
      $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
         if not defined $ok;

The lexeme_read_string() method takes 2 arguments, both required. Call them, in order, $symbol_name and $string. $symbol_name is the symbol name of a token to be read. $string is a string which becomes both the value of the token and its literal equivalent. lexeme_read_literal() is an external scanning completion method and, with two important exceptions, the details of its behavior are as described above.

The first difference is that, on success, lexeme_read_string() creates a new input text block, using $string as its text. We'll call this block the "per-string block". The literal equivalent of the token will be the per-string block, starting at offset 0 and ending at eoblock.

The second difference is that, after a successful call to lexeme_read_string(), the per-string block does not become the new current block. The current block data after a call to lexeme_read_string() will be the same as it was before the call to lexeme_read_string().

For most purposes, then, the per-string block is invisible to the app that called lexeme_read_string(). Apps which trace or keep track of the details of the input text blocks may notice the additional block. Also, event handlers which trigger during the lexeme_read_string() method will see the per-string block.

Return values: On success, lexeme_read_block() returns the new current offset. Soft failure occurs if and only if the token was rejected. On soft failure, lexeme_read_block() returns a Perl undef. Other failures are thrown as exceptions.

    $recce->lexeme_read_string($symbol, $string)

is roughly equivalent to

    sub read_string_equivalent_hi {
        my ( $recce, $symbol_name, $string ) = @_;
        my ($save_block) = $recce->block_progress();
        my $new_block = $recce->block_new( \$string );
        my $return_value = $recce->lexeme_read_literal( $symbol_name, $new_block );
        $recce->block_set($save_block);
        return $return_value;
    }

lexeme_read_string() is not designed for very long values of $string. For efficiency with long strings, use the equivalent in terms of lexeme_read_literal(), as just shown. lexeme_read_literal() sets the value of the token to a span of an input text block, while lexeme_read_string() sets the value of the token to a string. Marpa::R3 optimizes token values when they are literals in its input text blocks.

In terms of low-level external scanning methods, lexeme_read_string() is roughly equivalent to

    sub read_string_equivalent_lo {
        my ($recce, $symbol_name, $string) = @_;
        my ($save_block) = $recce->block_progress();
        my $lexeme_block = $recce->block_new( \$string );
        return if not defined $recce->lexeme_alternative( $symbol_name, $string );
        my $return_value = $recce->lexeme_complete( $lexeme_block );
        $recce->block_set($save_block);
        return $return_value;
    }

The example just above shows the value of the token being set to a string in the lexeme_alternative() call. As mentioned, this is not efficient for very long strings.

Low-level mutators

This section documents the low-level external scanning methods. The low-level mutators allows some advanced techniques, notably the reading of ambiguous tokens. Most applications will want to use the high-level methods instead.

lexeme_alternative()

    my $ok = $recce->lexeme_alternative( $symbol_name, $value );
    if (not defined $ok) {
        my $literal = $recce->literal( $block_id, $offset, $length );
        die qq{Parser rejected symbol named "$symbol_name" },
            qq{at position $offset, before lexeme "$literal"};
    }

lexeme_alternative() is one of the low-level methods of the external scanner. Most applications will prefer the simpler lexeme_read_string(), lexeme_read_literal() and lexeme_read_block() methods.

lexeme_alternative() takes up to two arguments. Call them, in order, $symbol_name and $value. $symbol_name is required and must be the name of a symbol to be read at the current location. $value is optional, and specifies the value of the symbol. If $value is missing, the value of the symbol will be a Perl undef.

The lexeme_alternative() method is a non-completion method -- it adds to the list of accepted tokens. To be read by the parser, this list of accepted tokens must be completed by a later call to the lexeme_complete() method. By making two or more calls of the lexeme_alternative() method before the next call of lexeme_complete(), an app can read an ambiguous token. (An ambiguous token is one which can be more than one symbol.)

When the recognizer starts, the list of accepted tokens is empty. The list of accepted tokens is cleared whenever lexeme_complete() is called. It is a fatal error if a high level scanning method is called while the list of accepted tokens is non-empty,

lexeme_alternative() has a soft failure if it rejects $symbol_name. All other failures are hard failures.

Return values: Returns undef if the token was rejected. On success, returns a value reserved for future use. The value on success will not necessarily be a Perl true, so that apps testing for rejection must test for a Perl undef explicitly. Failures are thrown as exceptions.

lexeme_alternative_literal()

    my $ok = $recce->lexeme_alternative_literal($symbol_name);
    die qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
        $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
            if not defined $ok;

lexeme_alternative_literal() is one of the low-level methods of the external scanner. Most applications will prefer the simpler lexeme_read_string(), lexeme_read_literal() and lexeme_read_block() methods.

lexeme_alternative_literal() takes only one, required, argument. lexeme_alternative_literal() and lexeme_alternative() differ from each other only in their arguments, and in how they set the value of the token. For a token read by lexeme_alternative_literal(), the value of the token will be the same as its literal equivalent. This literal equivalent will be set by the next call to lexeme_complete(). Otherwise, lexeme_alternative_literal() behaves in exactly the same way as lexeme_alternative().

lexeme_complete()

    my $new_offset = $recce->lexeme_complete( $block_id, $offset, $length );

lexeme_complete() is one of the low-level methods of the external scanner. Most applications will prefer the simpler lexeme_read_string(), lexeme_read_literal() and lexeme_read_block() methods.

Use of the low-level methods allows the reading of ambiguous tokens. lexeme_complete() completes the reading of a set of tokens specified by one or more calls of the lexeme_alternative() method.

The lexeme_complete() method accepts three optional arguments. Call them, in order, $block_id, $offset and $length. These are treated as a block span. The block span is used to set the literal equivalent for the set of alternative tokens completed by the lexeme_complete() call.

lexeme_read_literal() is an external scanning completion method and, with one important difference, the details of its behavior are as described above. The difference is that token rejection never occurs in lexeme_complete(). lexeme_complete() relies on the app to have built a list of accepted tokens using the lexeme_alternative() or lexeme_alternative_literal() calls.

It is a hard failure if lexeme_complete() is called but the list of tokens accepted by lexeme_alternative() or lexeme_alternative_literal() methods is empty. All failures in lexeme_complete() are hard failures.

Return values: On success, lexeme_complete() returns the new current location. Failure is always thrown.

COPYRIGHT AND LICENSE

  Marpa::R3 is Copyright (C) 2017, Jeffrey Kegler.

  This module is free software; you can redistribute it and/or modify it
  under the same terms as Perl 5.10.1. For more details, see the full text
  of the licenses in the directory LICENSES.

  This program is distributed in the hope that it will be
  useful, but without any warranty; without even the implied
  warranty of merchantability or fitness for a particular purpose.