pod/Ext_Scan.pod - metacpan.org

# Marpa::R3 is Copyright (C) 2017, Jeffrey Kegler.
#
# This module is free software; you can redistribute it and/or modify it
# under the same terms as Perl 5.10.1. For more details, see the full text
# of the licenses in the directory LICENSES.
#
# This program is distributed in the hope that it will be
# useful, but it is provided "as is" and without any express
# or implied warranties. For details, see the full text of
# of the licenses in the directory LICENSES.

=head1 Name

Marpa::R3::Ext_Scan - External scanning

=head1 Synopsis

=for Marpa::R3::Display
name: recognizer read/resume synopsis
partial: 1
normalize-whitespace: 1

    my @pause_location;
    my $recce = Marpa::R3::Recognizer->new(
        {
            grammar        => $parser->{grammar},
            event_handlers => {
                'before lstring' => sub () {
                    ( undef, undef, undef, @pause_location ) = @_;
                    'pause';
                },

            }
        }
    );
    my $length = length $string;
    for (
        my $pos = $recce->read( \$string );
        $pos < $length;
        $pos = $recce->resume()
        )
    {
        my $start = $pause_location[1];
        my $length = $pause_location[2];
        my $value = substr $string, $start + 1, $length - 2;
        $value = decode_string($value) if -1 != index $value, '\\';
        $recce->lexeme_read_block( 'lstring', $value, undef, $start, $length ) // die;
    } ## end for ( my $pos = $recce->read( \$string ); $pos < $length...)
    my $per_parse_arg = bless {}, 'MarpaX::JSON::Actions';
    my $value_ref = $recce->value($per_parse_arg);
    return ${$value_ref};

=for Marpa::R3::Display::End

=head1 About this document

This page describes B<external scanning>.
By default, Marpa::R3 scans based on the L0 grammar in
its DSL.
This DSL-driven scanning is called B<internal scanning>.

But
many applications find it useful or necessary to
do their own scanning in procedural code.
In Marpa::R3 this is called B<external scanning>.
External scanning can be used as a replacement for internal scanning.
Marpa::R3 also allows application to switch back and
forth between internal and external scanning.

=head1 Tokens

In external scanning, the app controls tokenization directly.
External scanning might also be called one-by-one scanning because,
in external scanning, the app feeds tokens to Marpa::R3 one-by-one.
This differs from internal scanning -- in internal scanning
Marpa::R3 tokenizes a string for the app.

Every token must have three things associated with it:

=over 4

=item 1.

A B<symbol name>, which is required.
The symbol name must be the name of a lexeme in
both the L0 and G1 grammars.
The symbol name
tells the parser which symbol represents this token to
the Marpa semantics.
The symbol name, in other words,
connects the token to the grammar.

=item 2.

A B<symbol value> or B<value>, which may be undefined.
The
value of the token is also seen by the semantics.

=item 3.

A B<literal equivalent>, which is required
and must be a span in the input.
The literal equivalent of a token is not directly visible
to the semantics,
although in Marpa::R3 it can always be accessed,
if desired.
The literal equivalent is needed for
the messages produced by
tracing, debugging, error reporting, etc.
If more than one token is accepted at a G1 location --
which can happen if tokens are ambiguous --
all of the tokens must have the same literal equivalent.

=back

=head1 Completion and non-completion methods

If a method might complete external scanning at a G1 location,
that method is called an external scanning completion method,
or just a B<completion method>.
Any other external scanning method is called a non-completion
method.

There are only two non-completion external scanning methods:
C<lexeme_alternative()>
and
C<lexeme_alternative_literal()>.
These are low-level methods which prepare a list of tokens
for the C<lexeme_complete()> completion method.

=head1 Completion method details

The external scanning completion methods
have almost all of their behaviors in common.
For convenience, therefore, the usual behaviors
of the completion methods are described in the section,
and exceptions to these behaviors are noted
in the descriptions of the individual methods.

External scanning completion can succeed or fail.
If external scanning completion fails, the failure may be hard
or soft.
The only soft failure that can occur in external scanning completion
is the rejection of a token.

=head2 Block location

Every external scanning completion must have a valid block span,
unless that completion results in a hard failure.
How that valid block span is specified varies by method.
For the purposes of this section, let that
block span be C<< <$block_id, $offset, $length> >>.
Also for the purposes of this section,
we will define B<eolexeme>,
or "end of lexeme",
as C<$offset> + C<$length>.

=over 4

=item *

If external scanning completion succeeds and no event occurs,
the current block is set to C<$block_id>.
The current offset is set to eolexeme.
The current eoread of the current block will
not be changed.

=item *

If external scanning completion succeeds and an event occurs,
the current block is set to C<$block_id>.
The current offset is set to the event location.
The event location will be the same as eolexeme,
unless the event was a pre-lexeme event.
The current eoread of the current block will
not be changed.

=item *

This is a special case of the immediately preceding case.
If external scanning completion succeeds and a pre-lexeme event occurs,
the current block is set to C<$block_id>.
The current offset is set to the event location,
which will be the same as C<$offset>.
The current eoread of the current block will
not be changed.

=item *

If an external scanning completion method rejects a token,
then the external scanning completion results in a soft failure.
In this case the current block data
remains unchanged.

=item *

Any failure in external scanning completion,
other than token rejection,
is a hard failure.
In the case of a hard failure,
no guarantee is made about the current block data.
Marpa::R3 will attempt to leave the current block data valid
and pointing to an "error location" -- that is, a location
as relevant as possible to the error.

=back

=head2 G1 location

If external scanning completion succeeds,
and a pre-lexeme event does not occur,
a token is read and
Marpa::R3 advances
the current G1 location by one.
The token just read will start at the previous G1 location
and end at the new current G1 location.
The G1 location of the token will be considered to be
the new current G1 location.

If external scanning completion succeeds,
and a pre-lexeme event does occur,
no token is read.
The current G1 location will remain where it was
before external scanning.

If external scanning completion has a soft failure,
no token is read.
The current G1 location will remain where it was
before external scanning.

If external scanning completion has a hard failure,
no guarantee is made about the current G1 location.
Marpa::R3 will attempt to leave it valid and unchanged.

=head2 Event handlers

L<Parse events|Marpa::R3::Event>
may occur during external scanning completion.
The event handlers will see
a G1 location and an event location as described
above.

=head1 Mixing internal and external scanning

External scanning can be mixed with internal scanning
to get the best of both.
An application can
terminate the internal scanning of
the C<read()> method early, if it has defined an parse event
and that parse event triggers.
Afterwards,
internal scanning can be resumed with the L<resume()|/"resume()"> method.
For details,
see L<the description of C<resume()>|Marpa::R3::Recognizer/"resume()">,
as well as
L<the separate document for events|Marpa::R3::Event>.

=head1 High-level mutators

Most applications doing external scanning
will want to use the high-level methods.
The L<C<< $recce->lexeme_read_string() >> method|/"lexeme_read_string()">
allows the reading of a string, where the string is both
the literal equivalent of the input, and its value for semantics.
The L<C<< $recce->lexeme_read_literal() >> method|/"lexeme_read_literal()">
is similar, but the string is specified as a block span.

The L<C<< $recce->lexeme_read_block() >> method|/"lexeme_read_block()">
is the most general of the high-level external scanning methods.
C<lexeme_read_block()> allows the app to specify the literal equivalent
and the value separately.

=head2 lexeme_read_block()

=for Marpa::R3::Display
name: recognizer lexeme_read_block() synopsis
partial: 1
normalize-whitespace: 1

    my $ok = $recce->lexeme_read_block($symbol_name, $value,
        $main_block, $start_of_lexeme, $lexeme_length);
    die qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
      $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
          if not defined $ok;

=for Marpa::R3::Display::End

C<lexeme_read_block()> is the basic method for external scanning.
It takes five arguments, only the first of which is required.
Call them, in order,
C<$symbol_name>,
C<$value>,
C<$block_id>,
C<$offset>,
and
C<$length>.

The C<$symbol_name> argument is the name of the symbol
to scan.
The C<$value> argument will be the value of the token.
If C<$value> is missing or undefined,
the value of the token will be a Perl C<undef>.
The C<$block_id>, C<$offset>, and C<$length> arguments
are the literal equivalent of the token, as a 
L<block span|/"Block spans">.
C<lexeme_read_block()> is an external scanning completion
method and details of its behavior are
L<as described above|"Completion method details">.

B<Return values>:
On success, C<lexeme_read_block()> returns the new current offset.
Soft failure occurs if and only if
the token was rejected.
On soft failure,
C<lexeme_read_block()> returns a Perl C<undef>.
Other failures are thrown as exceptions.

=for Marpa::R3::Display
ignore: 1

    $recce->lexeme_read_block($symbol, $start, $length, $value)

=for Marpa::R3::Display::End

is roughly equivalent to

=for Marpa::R3::Display
name: recognizer lexeme_read_block() low-level equivalent
normalize-whitespace: 1

    sub read_block_equivalent {
        my ( $recce, $symbol_name, $value, $block_id, $offset, $length ) = @_;
        return if not defined $recce->lexeme_alternative( $symbol_name, $value );
        return $recce->lexeme_complete( $block_id, $offset, $length );
    }

=for Marpa::R3::Display::End

=head2 lexeme_read_literal()

=for Marpa::R3::Display
name: recognizer lexeme_read_literal() synopsis
partial: 1
normalize-whitespace: 1

    my $ok = $recce->lexeme_read_literal($symbol_name, $main_block, $start_of_lexeme, $lexeme_length);
    die qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
       $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
           if not defined $ok;

=for Marpa::R3::Display::End

C<lexeme_read_literal()>
takes four arguments, only the first of which is required.
Call them, in order,
C<$symbol_name>,
C<$block_id>,
C<$offset>,
and
C<$length>.
The C<$symbol_name> argument is the name of the symbol
to scan.
The C<$block_id>, C<$offset>, and C<$length> arguments
are the literal equivalent of the token, as a 
L<block span|/"Block spans">.
The value of the token will be the same
as its literal equivalent.
C<lexeme_read_literal()> is an external scanning completion
method and details of its behavior are
L<as described above|"Completion method details">.

=for Marpa::R3::Display
ignore: 1

    $recce->lexeme_read_literal($symbol, $start, $length, $value)

=for Marpa::R3::Display::End

is roughly equivalent to

=for Marpa::R3::Display
name: recognizer lexeme_read_literal() high-level equivalent
normalize-whitespace: 1

    sub read_literal_equivalent_hi {
        my ( $recce, $symbol_name, $block_id, $offset, $length ) = @_;
        my $value = $recce->literal( $block_id, $offset, $length );
        return $recce->lexeme_read_block( $symbol_name, $value, $block_id, $offset, $length );
    }

=for Marpa::R3::Display::End

In terms of low-level external scanning methods,
the above is roughly equivalent to

=for Marpa::R3::Display
name: recognizer lexeme_read_literal() low-level equivalent
normalize-whitespace: 1

    sub read_literal_equivalent_lo {
        my ( $recce, $symbol_name, $block_id, $offset, $length ) = @_;
        return if not defined $recce->lexeme_alternative_literal( $symbol_name );
        return $recce->lexeme_complete( $block_id, $offset, $length );
    }

=for Marpa::R3::Display::End

=head2 lexeme_read_string()

=for Marpa::R3::Display
name: recognizer lexeme_read_string() synopsis
partial: 1
normalize-whitespace: 1

    my $ok = $recce->lexeme_read_string( $symbol_name, $lexeme );
    die qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
      $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
         if not defined $ok;

=for Marpa::R3::Display::End

The C<lexeme_read_string()> method takes 2 arguments, both required.
Call them, in order, C<$symbol_name> and C<$string>.
C<$symbol_name>
is the symbol name of a token
to be read.
C<$string>
is a string which becomes both the value of the token
and its literal equivalent.
C<lexeme_read_literal()> is an external scanning completion
method and, with two important exceptions,
the details of its behavior are
L<as described above|"Completion method details">.

The first difference is that,
on success,
C<lexeme_read_string()>
creates a new input text block,
using C<$string> as its text.
We'll call this block the "per-string block".
The literal equivalent of the token
will be the per-string block, starting at offset 0 and ending at eoblock.

The second difference is that,
after a successful call to C<lexeme_read_string()>,
the per-string block does B<not> become the new current block.
The current block data after a call to
C<lexeme_read_string()>
will be the same as it was before
the call to C<lexeme_read_string()>.

For most purposes, then,
the per-string block
is invisible to the
app that called
C<lexeme_read_string()>.
Apps which trace or keep track of the details of the input text blocks
may notice the additional block.
Also, event handlers
which
trigger during the C<lexeme_read_string()> method
will see the per-string block.

B<Return values>:
On success, C<lexeme_read_block()> returns the new current offset.
Soft failure occurs if and only if
the token was rejected.
On soft failure,
C<lexeme_read_block()> returns a Perl C<undef>.
Other failures are thrown as exceptions.

=for Marpa::R3::Display
ignore: 1

    $recce->lexeme_read_string($symbol, $string)

=for Marpa::R3::Display::End

is roughly equivalent to

=for Marpa::R3::Display
name: recognizer lexeme_read_string() high-level equivalent
normalize-whitespace: 1

    sub read_string_equivalent_hi {
        my ( $recce, $symbol_name, $string ) = @_;
        my ($save_block) = $recce->block_progress();
        my $new_block = $recce->block_new( \$string );
        my $return_value = $recce->lexeme_read_literal( $symbol_name, $new_block );
        $recce->block_set($save_block);
        return $return_value;
    }

=for Marpa::R3::Display::End

C<lexeme_read_string()> is not designed for
very long values of C<$string>.
For efficiency with long strings,
use the equivalent in terms of C<lexeme_read_literal()>, as just shown.
C<lexeme_read_literal()> sets the value of the token to a span
of an input text block,
while C<lexeme_read_string()> sets the value of the token to a string.
Marpa::R3 optimizes token values when they are literals in its
input text blocks.

In terms of low-level external scanning methods,
C<lexeme_read_string()> is roughly equivalent to

=for Marpa::R3::Display
name: recognizer lexeme_read_string() low-level equivalent
normalize-whitespace: 1

    sub read_string_equivalent_lo {
        my ($recce, $symbol_name, $string) = @_;
        my ($save_block) = $recce->block_progress();
        my $lexeme_block = $recce->block_new( \$string );
        return if not defined $recce->lexeme_alternative( $symbol_name, $string );
        my $return_value = $recce->lexeme_complete( $lexeme_block );
        $recce->block_set($save_block);
        return $return_value;
    }

=for Marpa::R3::Display::End

The example just above shows the value of the token being set to a string
in the C<lexeme_alternative()> call.
As mentioned, this is not efficient for very long strings.

=head1 Low-level mutators

This section documents the low-level external scanning methods.
The low-level mutators allows some advanced techniques,
notably the reading of ambiguous tokens.
Most applications will want to use
L<the high-level methods|High-level mutators>
instead.

=head2 lexeme_alternative()

=for Marpa::R3::Display
name: recognizer lexeme_alternative() synopsis
partial: 1
normalize-whitespace: 1

    my $ok = $recce->lexeme_alternative( $symbol_name, $value );
    if (not defined $ok) {
        my $literal = $recce->literal( $block_id, $offset, $length );
        die qq{Parser rejected symbol named "$symbol_name" },
            qq{at position $offset, before lexeme "$literal"};
    }

=for Marpa::R3::Display::End

C<lexeme_alternative()> is one of the low-level methods of the external scanner.
Most applications
will prefer the simpler
L<C<lexeme_read_string()>|/"lexeme_read_string()">,
L<C<lexeme_read_literal()>|/"lexeme_read_literal()">
and L<C<lexeme_read_block()>|/"lexeme_read_block()"> methods.

C<lexeme_alternative()> takes up to two arguments.
Call them, in order, C<$symbol_name> and C<$value>.
C<$symbol_name> is required
and must be the name of a symbol to be read
at the current location.
C<$value> is optional,
and specifies the value of the symbol.
If C<$value> is missing, the value of the symbol
will be a Perl C<undef>.

The C<lexeme_alternative()> method is a non-completion method -- it adds to
the list of accepted tokens.
To be read by the parser, this list of accepted tokens
must be completed by a later call
to the C<lexeme_complete()> method.
By making two or more calls
of the C<lexeme_alternative()> method
before the next call of C<lexeme_complete()>,
an app can read an
ambiguous token.
(An ambiguous token is one which can be more than one symbol.)

When the recognizer starts,
the list of accepted tokens is empty.
The list of accepted tokens is cleared whenever C<lexeme_complete()>
is called.
It is a fatal error
if a high level scanning method is called while the list of accepted
tokens is non-empty,

C<lexeme_alternative()> has a soft failure if it rejects 
C<$symbol_name>.
All other failures are hard failures.

B<Return values>:
Returns C<undef> if the token was rejected.
On success, returns a value reserved for future use.
The value on success will not necessarily be a Perl true,
so that apps testing for rejection must test for a Perl C<undef> explicitly.
Failures are thrown as exceptions.

=head2 lexeme_alternative_literal()

=for Marpa::R3::Display
name: recognizer lexeme_alternative_literal() synopsis
partial: 1
normalize-whitespace: 1

    my $ok = $recce->lexeme_alternative_literal($symbol_name);
    die qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
        $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
            if not defined $ok;

=for Marpa::R3::Display::End

C<lexeme_alternative_literal()> is one of the low-level methods of the external scanner.
Most applications
will prefer the simpler
L<C<lexeme_read_string()>|/"lexeme_read_string()">,
L<C<lexeme_read_literal()>|/"lexeme_read_literal()">
and L<C<lexeme_read_block()>|/"lexeme_read_block()"> methods.

C<lexeme_alternative_literal()> takes only one, required, argument.
C<lexeme_alternative_literal()> and
C<lexeme_alternative()> differ from each other
only in their arguments,
and in how they set the value of the token.
For a token read by C<lexeme_alternative_literal()>,
the value of the token will be the same as its literal equivalent.
This literal equivalent
will be set by the next call to 
C<lexeme_complete()>.
Otherwise, C<lexeme_alternative_literal()> behaves in
exactly the same way as
C<lexeme_alternative()>.

=head2 lexeme_complete()

=for Marpa::R3::Display
name: recognizer lexeme_complete() synopsis
partial: 1
normalize-whitespace: 1

    my $new_offset = $recce->lexeme_complete( $block_id, $offset, $length );

=for Marpa::R3::Display::End

C<lexeme_complete()> is one of the low-level methods of the external scanner.
Most applications
will prefer the simpler
L<C<lexeme_read_string()>|/"lexeme_read_string()">,
L<C<lexeme_read_literal()>|/"lexeme_read_literal()">
and L<C<lexeme_read_block()>|/"lexeme_read_block()"> methods.

Use of the low-level methods allows
the reading of ambiguous tokens.
C<lexeme_complete()>
completes the reading of a set of tokens specified by
one or more calls of the L<C<lexeme_alternative()>|/"lexeme_alternative()">
method.

The C<lexeme_complete()> method
accepts three optional arguments.
Call them, in order,
C<$block_id>, C<$offset> and C<$length>.
These are treated as a
L<block span|Marpa::R3::Recognizer/"Block spans">.
The block span is used to set the literal equivalent for the
set of alternative tokens completed by the
C<lexeme_complete()> call.

C<lexeme_read_literal()> is an external scanning completion
method and, with one important difference,
the details of its behavior are
L<as described above|"Completion method details">.
The difference is that token rejection never occurs
in C<lexeme_complete()>.
C<lexeme_complete()> relies on the app to have built a list
of accepted tokens using the
C<lexeme_alternative()>
or
C<lexeme_alternative_literal()>
calls.

It is a hard failure if C<lexeme_complete()> is called
but the list of tokens accepted by
C<lexeme_alternative()>
or
C<lexeme_alternative_literal()>
methods is empty.
All failures in C<lexeme_complete()> are hard failures.

B<Return values:>
On success, C<lexeme_complete()>
returns the new current location.
Failure is always thrown.

=head1 COPYRIGHT AND LICENSE

=for Marpa::R3::Display
ignore: 1

  Marpa::R3 is Copyright (C) 2017, Jeffrey Kegler.

  This module is free software; you can redistribute it and/or modify it
  under the same terms as Perl 5.10.1. For more details, see the full text
  of the licenses in the directory LICENSES.

  This program is distributed in the hope that it will be
  useful, but without any warranty; without even the implied
  warranty of merchantability or fitness for a particular purpose.

=for Marpa::R3::Display::End

=cut

# vim: expandtab shiftwidth=4:
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)