pod/Recognizer.pod - metacpan.org

# Marpa::R3 is Copyright (C) 2018, Jeffrey Kegler.
#
# This module is free software; you can redistribute it and/or modify it
# under the same terms as Perl 5.10.1. For more details, see the full text
# of the licenses in the directory LICENSES.
#
# This program is distributed in the hope that it will be
# useful, but it is provided "as is" and without any express
# or implied warranties. For details, see the full text of
# of the licenses in the directory LICENSES.

=head1 Name

Marpa::R3::Recognizer - Recognizer objects

=head1 Synopsis

=for Marpa::R3::Display
name: Scanless recognizer synopsis
partial: 1
normalize-whitespace: 1

    my $recce = Marpa::R3::Recognizer->new( { grammar => $grammar } );
    my $self = bless { grammar => $grammar }, 'My_Actions';
    $self->{recce} = $recce;

    if ( not defined eval { $recce->read($p_input_string); 1 }
        )
    {
        ## Add last expression found, and rethrow
        my $eval_error = $EVAL_ERROR;
        chomp $eval_error;
        die $self->show_last_expression(), "\n", $eval_error, "\n";
    } ## end if ( not defined eval { $event_count = $recce->read...})

    my $value_ref = $recce->value( $self );
    if ( not defined $value_ref ) {
        die $self->show_last_expression(), "\n",
            "No parse was found, after reading the entire input\n";
    }

=for Marpa::R3::Display::End

=for Marpa::R3::Display
name: Scanless recognizer semantics
partial: 1
normalize-whitespace: 1

    package My_Actions;
    sub do_parens    { return $_[1]->[1] }
    sub do_add       { return $_[1]->[0] + $_[1]->[2] }
    sub do_subtract  { return $_[1]->[0] - $_[1]->[2] }
    sub do_multiply  { return $_[1]->[0] * $_[1]->[2] }
    sub do_divide    { return $_[1]->[0] / $_[1]->[2] }
    sub do_pow       { return $_[1]->[0]**$_[1]->[2] }
    sub do_first_arg { return $_[1]->[0] }
    sub do_script    { return join q{ }, @{$_[1]} }

=for Marpa::R3::Display::End

=head1 About this document

This page is the reference document for Marpa::R3's recognizer objects.
The Marpa::R3 DSL contains its own B<internal scanner>,
integrated into its syntax.
In this document, use of the
internal scanner is called
B<internal scanning>.

Many applications find it useful or necessary to
do their own scanning,
and Marpa::R3 allows this.
When an application
bypasses the internal scanner
and does its own scanning,
this document calls it
B<external scanning>.
External scanning can be used instead of internal scanning,
but Marpa::R3 also allows application to switch back and
forth between internal and external scanning.
External scanning is described in
L<its own document|Marpa::R3::Ext_Scan>.

=head1 Block position

If an app uses the basic internal scanning method,
L<C<read()>|/"read()">,
and accepts its defaults,
the Marpa::R3 input
is very simple.
The input is provided as a pointer to a string.
Scanning starts at the first
position and continues to the last character of the string.

Many applications fit this model.
But many common practical applications
want to do one or more of the following

=over 4

=item *
Stop reading the current file,
read from another one,
then resume reading the current file.
Preprocessors, including the one for the
C language, commonly need to do this.

=item *
Use "Include" files as just described,
but also nest them.
The C preprocessor does this.

=item *
Allow "here-docs".
Here-docs are similar to "include" files, except
they are not separate files,
but distant portions of the current file.
Perl does this.

=item *
Implement "if/else" preprocessor logic, skipping sections
of the current input.
The C preprocessor does this.

=item *
Allow externally scanned symbols that do not refer
to any text in the main input string.
For reasonable error messages to be generated,
some sort of literal equivalent of the symbol needs
to be available.

=back

To implement the above features,
an app must be able to switch
among multiple input strings,
and be able to jump forward
and backward within those input strings.
Marpa::R3's input model has these abilities.

Marpa::R3 allows multiple input strings.
In these documents, these input strings are
called C<input text blocks> or,
more often,
C<blocks>.
Marpa::R3 will convert all blocks into Unicode.
Characters in Marpa::R3 are
Unicode codepoints.
Numbering of characters is 0-based.

Reading is always done from the
the B<current input text block>,
or
B<current block>.
When the recognizer is created, there is
no 
current input text block.
The L<C<read()>|/"read()">
method sets the current block implicitly.
Other methods allow the app
to create blocks
and to set the current block explicitly.

Location in the input is a duple: 
C<< <block_id, block_offset> >>,
where C<block_id> is a B<block ID>,
and C<block_offset> is a B<block offset>.
Each such duple uniquely identifies a
B<block location>.

A block ID uniquely identifies a block.
Block ID happens to be an integer,
but it should be treated as opaque.

The block offset is a 0-based integer.
By default, the next character read is
the character at the current block offset
of the current block.
When context makes it clear,
the block offset is called
simply the C<offset>.
All blocks, whether current or not,
maintain a B<current offset>.

An offset of I<L>,
where I<L> is the length of the block under
discussion,
is called
the B<end of the physical block>,
B<end of block>,
or B<eoblock>.
Eoblock represents the offset immediately after
the last character of the physical block.

In any block,
its eoblock is
the highest valid block offset.
When the current offset of a block is equal to its eoblock,
that block has been read all
the way to the end,
and there are no more characters to be read.
For
an app to read more characters when the current offset
of the current block
is at eoblock,
either the app must switch blocks,
or the app must move the block offset of the current block.

Every block maintains
a B<current end of read>.
The B<current end of read> is also
referred to as the block's B<end of read>
or B<eoread>.
Reading of a block ends when the next character read
would be the one at eoread.
In effect, eoread is a temporary eoblock.
It is often used for
to tell the recognizer to read
only a portion of a block.

When a block is initialized, 
its current offset is 0 and
its eoread is equal to its eoblock.

In these documents,
unless stated otherwise,
the bare terms B<location>
and B<position> will mean 
block position as described in this section.
When these documents refer to the B<current block data>,
they will mean these 3 data items:

=over 4

=item *

the recognizer's current block setting;

=item *

the current offset of that current block; and

=item *

the current eoread of that current block.

=back

=head1 G1 location

In addition to block position,
Marpa::R3 apps sometimes have to deal with position
in terms of lexemes.
(Marpa::R3's lexemes are similar to what in other parsers
are called "tokens".)
Position in terms of lexemes
is called B<G1 Earley set location> or,
more often,
B<G1 location>.

G1 location is 0-based --
The first lexeme is at G1 location 0.
Every G1 location corresponds to at least one lexeme.
But Marpa::R3 allows ambiguous lexemes --
more than one lexeme can be read at a single G1 location.

G1 location can be ignored most of the time,
but it
is useful for tracing the G1 grammar
and for advanced techniques.

=head1 G1 fenceposts

Restating the definition of G1 location above
more pedantically,
the G1 locations are an ordered set
of sets of tokens.
G1 locations B<contain> sets of tokens.

It is sometimes useful to use an idea of locations
between G1 locations --
This is the classic "fencepost" issue -- sometimes
you want to count sections of fence,
and in other cases it is more convenient to count
fenceposts.

Let the G1 locations be from 0 to C<N>.
If C<N> is greater than 1 and less than C<N>,
G1 fencepost C<N> is before G1 location C<N>
and after G1 location C<N-1>.
G1 fencepost C<0> is before G1 location 0.
G1 fencepost C<N> is after G1 location C<N>

The above implies that

=over 4

=item *

If there are C<N> G1 locations,
there are C<N+1> G1 fenceposts.

=item *

G1 location C<I> is always between
G1 fencepost C<I> and
G1 fencepost C<I+1>.

=back

=head1 Universes

In order to have a common language
for locations, spans, ranges and offsets,
the concept of a B<universe> of locations
is useful.
Each type of location has its own "universe".

For blocks, the universe is the ordered set of
characters in the input text block
under discussion,
and B<end of universe> is the last physical
character of the block.
This will be the character
at offset C<eoblock - 1>,
that is,
the character immediately before eoblock.
Eoread has no effect on end-of-universe.

For G1 locations, the universe is the ordered set of
G1 locations, and the B<end of universe> is
the last G1 location of that ordered set.

=head1 Negative offsets and positions

The negative offset -I<N> is the same as the
offset I<EOU> - I<N> + 1,
where I<EOU> is the positive
offset of the "end of universe" element.
Therefore, suppose that block 7 has a length of 42.
Since block offsets are 0-based,
the last character in block 7 will be at offset 41,
so that offset 41 is the "end of universe".
Therefore,

=over 4

=item *
Block position C<7, -1>
is the same as block position C<7, 41>.

=item *
Block position C<7, -2>
is the same as block position C<7, 40>.

=item *
Block position C<7, -42>
is the same as block position C<7, 0>.

=back

=head1 Ranges

In this document, we will refer to ordered subsets of
contiguous locations as either B<ranges> or B<spans>.
A range is an ordered set of contiguous locations
specified by start location and end location:
[I<S ... E>].
Alternatively, a range can be seen
as the string of locations
from offset I<S> to offset I<E>.
A range is inclusive, so that
it includes the location I<S>
as well as the location I<E>.
In a range of length 1,
I<S> will be equal to I<E>.

=head1 Spans

A span is an ordered set of contiguous locations
specified by start location and length:
[I<S, L>].
A span is a subset of a universe of locations,
as was
described above for ranges.

The range corresponding to the span
[I<S, L>] is [I<S ... (S+L)-1>].
The span corresponding to the range
[I<S ... E>] is [I<S, (S-E)+1>].
A span with
a negative length
is interpreted as if it was the range
with that same pair of values.

In general, spans are more convenient for programming.
But when fencepost issues are important,
spans require a lot of mental arithmetic,
and a discussion that uses ranges is easier to follow.

For the sake of some examples,
consider a 0-based input text block of length 100.

=over

=item *

The entire block is the range C<[0 ... -1]>
and the span C<[0, -1]>.

=item *

The first 42 characters of the input stream are the range C<[0 ... 41]>
and the span C<[0, 42]>.

=item *

The entire block, except for the last character,
is the range C<[0 ... -2]>
and the span C<[0, -2]>.

=item *

The substring consisting only of the last character is
the range C<[-1 ... -1]> and the span C<[-1, 1]>.

=item *

The substring which consists of the last 3 characters is
the range C<[-3 ... -1]>
and the span C<[-3,  3]>.

=item *

The substring which consists
of only the third-to-last character
is the range C<[-3 ... -3]>
and the span C<[-3, 1]>.

=back

=head1 Block spans

A B<block span> is a span in terms of block location.
It is specified as a triple
C<< <$block_id>, <$offset>, <$length> >>.
Sometimes one or more elements of a block span are undefined.
(For example, this can happen when a block span is part of a set of arguments.)
If C<$block_id> is missing or undefined, the specified block is the current block.
If C<$offset> is missing or undefined,
the specified offset is the current offset of the specified block.
If C<$length> is missing or undefined,
the specified length is to the eoread of the
specified block.
If C<$length> is defined and positive,
but would result in a span that goes beyond eoread,
the specified length is truncated.
Block span is defined more formally
L<below|/"Block span details">.

=head1 Internal scanning

Several Marpa::R3 method calls do internal scanning.
The most basic of these is
the L<C<read()>|/"read()"> method.

Internal scanning requires there to be a current block.
The current block will have a current offset and an eoread.
Internal scanning reads and parses the text of the current block,
starting from the current offset,
until one of the following occurs:

=over 4

=item *
The next character read would be at eoread.
In this case, current offset is equal to eoread.
The character at eoread is B<not> read.

=item *
A parse event is triggered.
In this case, the current offset is
set to the event locaiton.

=item *
A soft failure occurs.
The current offset is set to the
location of the soft failure.

=item *
A hard failure occurs.
The current block data is undefined,
but Marpa::R3 attempts to set it to the
location of the hard failure.

=back

A special case of internal scanning
is when, at the start of internal scanning,
eoread is equal to the current offset.
This happens, for example, if the C<$length> argument
to the
L<read() method|/"read()">
is set to 0.
In this case, as the above rules imply,
internal scanning stops immediately,
leaving current offset unchanged.
This can be used as a technique to set a new input
text block without reading it,
perhaps in preparation for external scanning.

=head1 Evaluation

After scanning, an app almost always wants to evaluate the parse.
This can be done using
L<the recognizer's C<value()>
method|/"value()">.
For advanced techniques, such as retrieving multiple values
from an ambigious parse,
Marpa::R3's
L<valuer objects|Marpa::R3::Valuer>
can be used.

=head1 Recognizer settings

The B<recognizer settings> are the named arguments
accepted by
the recognizer's L<C<new()>|/"Constructor">,
and/or its L<C<set()>|/"set()"> method.

=head2 event_is_active

=for Marpa::R3::Display
name: recognizer event_is_active named arg synopsis
normalize-whitespace: 1

    my $slr = Marpa::R3::Recognizer->new(
        {
            grammar         => $grammar,
            event_is_active => { 'before c' => 1, 'after b' => 0 },
            event_handlers  => {
                'after a'  => $after_handler,
                'after b'  => $after_handler,
                'after c'  => $after_handler,
                'after d'  => $after_handler,
                'before a' => $before_handler,
                'before b' => $before_handler,
                'before c' => $before_handler,
                'before d' => $before_handler,
            }
        }
    );

=for Marpa::R3::Display::End

The C<event_is_active> recognizer setting
changes the activation setting of events.
Its value should be a reference to a hash,
in which the key of every entry is an event name,
and its value is either 0 or 1.
If the value is 1,
the event named in the hash key will be activated
when the recognizer starts.
If the value is 0,
the event named in the hash key will be inactive
when the recognizer starts.
The C<event_is_active> setting is only allowed
with the L<recognizer's C<new() method>|/"Constructor">.

The setting in the
C<event_is_active> hash
overrides the activation setting in the grammar.
The setting will be in effect
before events at earleme 0 are triggered,
and before any of the input stream is read.

The L<C<activate()> method|/"activate()">
can also be used to change an event's activation,
and that is usually preferable.
But
events at earleme 0
trigger during
the L<recognizer's C<new() method>|/"Constructor"> --
they can not be affected
by calls of the C<activate()> method.
If an event is initialized to inactive
in the grammar,
the C<event_is_active> recognizer setting
is the only way
for a recognizer
to allow that event to be active
at earleme 0.
Similarly,
if an event is initialized to active
in the grammar,
the C<event_is_active> recognizer setting
is the only way
for a recognizer
to set that event
to be inactive
at earleme 0.

=head2 event_handlers

=for Marpa::R3::Display
name: event examples - basic
normalize-whitespace: 1

    @results = ();
    $recce   = Marpa::R3::Recognizer->new(
        {
            grammar        => $grammar1,
            event_handlers => {
                A => sub () { push @results, 'A'; 'ok' },
                B => sub () { push @results, 'B'; 'ok' },
                C => sub () { push @results, 'C'; 'ok' },
            }
        }
    );

=for Marpa::R3::Display::End

The C<event_handlers> recognizer setting
sets the callbacks which handle events.
Its value should be a reference to a hash,
in which the key of every entry is an event name,
and 
each value of the hash must be an anonymous function.
For an individual entry of this hash,
call the key C<ev>
and call its value C<func>.
Event C<ev> will be handled by calling function
C<func>.

If any event occurs which does not have a handler,
it is a hard failure.
The C<event_handler> named argument is only allowed
with the L<C<recognizer's new() method>|/"Constructor">.
Full details of the use of event handlers,
with many examples,
are in
L<a separate document|Marpa::R3::Event>.

=head2 grammar

The value of the C<grammar> setting must be
a grammar object.
The C<new()> method is required to have
a C<grammar> setting.
The C<grammar> setting is only allowed
with the L<C<new() method>|/"Constructor">.
Once the recognizer is created, the grammar cannot be
changed.

=head2 too_many_earley_items

The C<too_many_earley_items> setting is optional,
and very few applications will need it.
If specified, it sets the B<Earley item warning threshold> to
a value other than its default.
If an Earley set becomes larger than the
Earley item warning threshold,
a recognizer event is generated,
and
a warning is printed to the trace file handle.

Marpa parses from any BNF,
even when the BNF describes a highly ambiguous grammar.
But highly ambiguous grammars can produce Earley sets large
enough to exceed the memory capacity of even the largest machines.
And parsing for less ambiguous grammars
can be impractically slow.

Ideally C<too_many_earley_items> is set at the
"point of no return" -- an Earley set size which,
when reached, indicates that
the Earley sets are destined to grow out of control.
By default, Marpa calculates
an Earley item warning threshold
for the G1 recognizer
based on the size of the
G1 grammar,
and for each L0 recognizer based on the size
of the L0 grammar.
These default thresholds will never be less than 100.
Marpa::R3's default is the result of considerable experience
and almost all users will be happy with it.

If the
Earley item warning threshold is changed from its default,
the change applies to both L0 and G1 -- currently
there is no way to set them separately.
If the Earley item warning threshold is set to 0,
no recognizer event is generated,
and
warnings about large Earley sets are turned off.
In fact,
Marpa::R3's default values are highly successful
in identifying the true "point of no return",
beyond which it is pointless to continue,
so that turning these warnings off will
rarely be something
that an application wants to do.

The C<too_many_earley_items> setting is allowed
by both
the recognizer's L<C<new()>|/"Constructor">
and its L<C<set()>|/"set()"> method.

=head2 trace_terminals

If non-zero, traces the lexemes --
those tokens passed from the L0 parser to
the G1 parser.
This recognizer setting is the best way to follow
what the L0 parser is doing,
and it is also very helpful for tracing the G1 parser.
The C<trace_terminals> setting is allowed
by both
the recognizer's L<C<new()>|/"Constructor">
and its L<C<set()>|/"set()"> method.

=head2 trace_values

The value of the C<trace_values> setting is a numeric trace level.
If the
numeric trace level is 1, Marpa prints tracing information as values
are computed in the evaluation stack.  A trace level of 0 turns
value tracing off, which is the default. Traces are written to the
trace file handle.
The C<trace_values> setting is allowed
by both
the recognizer's L<C<new()>|/"Constructor">
and its L<C<set()>|/"set()"> method.

=head2 trace_file_handle

The value is a file handle.
Trace output and warning messages
go to the trace file handle.
By default, the trace file handle is inherited from the
grammar.
The C<trace_file_handle> setting is allowed
by both
the recognizer's L<C<new()>|/"Constructor">
and its L<C<set()>|/"set()"> method.

=head1 Constructor

=for Marpa::R3::Display
name: Scanless recognizer synopsis
partial: 1
normalize-whitespace: 1

    my $recce = Marpa::R3::Recognizer->new( { grammar => $grammar } );

=for Marpa::R3::Display::End

The C<new()> method is the constructor for recognizers.
The arguments
to the C<new()> constructor must be one or more hashes of named arguments,
where each hash key is a recognizer setting.
The L<C<grammar>|/"grammar"> recognizer setting is required.
All other recognizer settings are optional.
For more on recognizer settings,
see
L<the section describing them|/"Recognizer settings">.

Parse events may occur during the recognizer's
C<new()> constructor.
Parse events are described in detail in
L<a separate document|Marpa::R3::Event>.

=head1 Mutators

=head2 activate()

=for Marpa::R3::Display
name: recognizer activate() method synopsis
partial: 1
normalize-whitespace: 1

        $recce->activate($_, 0) for @events;

=for Marpa::R3::Display::End

The C<activate()> method allows the recognizer to deactivate and reactivate
parse events.
Parse events are described in
L<a separate document|Marpa::R3::Event>.

The C<activate()> method takes two arguments.
The first is the name of an event, and the second (optional) argument is
0 or 1.
If the argument is 0, the event is deactivated.
If the argument is 1, the event is activated.
An argument of 1 is the default.
Since an recognizer always starts with all defined events
activated,
0 will probably be more common as the second argument to
C<activate()>

Although they are not reported until the call of the
C<read()> method,
location 0 events are triggered in the recognizer's
constructor,
before the C<activate()> method can be called.
To deactivate
location zero events,
use the L<C<event_is_active>|/"event_is_active"> recognizer setting
in that recognizer's L<C<new()>|/"Constructor">.

The overhead imposed by events
can be reduced by using the C<activate()> method.
But making many calls to
the C<activate()> method purely for efficiency
purposes will be counter-productive.
Also, deactivated events continue to impose
a small overhead, so if an event is never used,
it should be commented out in the DSL.

B<Return values>:
The return value is reserved for future use.
Currently all failures are hard failures.
Hard failures are thrown.

=head2 lexeme_priority_set()

=for Marpa::R3::Display
name: recognizer lexeme_priority_set() synopsis
normalize-whitespace: 1

        $recce->lexeme_priority_set( 'prefix lexeme', -1 );

=for Marpa::R3::Display::End

Takes as its first argument the name of a lexeme
and changes the priority of that lexeme to the value
of its second argument.
Both arguments are required.

Changing the lexeme priority is a very flexible
technique.
It can, in effect, allow an application
to switch lexers.

B<Return values>:
On success, returns the old priority value.
Failure is thrown.

=head2 read()

=for Marpa::R3::Display
name: Scanless recognizer synopsis
partial: 1
normalize-whitespace: 1

    $recce->read($p_input_string);

=for Marpa::R3::Display::End

=for Marpa::R3::Display
name: recognizer read() synopsis
partial: 1
normalize-whitespace: 1

    $recce->read( \$string, 0, 0 );

=for Marpa::R3::Display::End

C<read()> is the basic method for internal scanning.
It creates a new block and scans it according to its
arguments.
C<read()> takes three arguments, only the first of which is required.
Call them, in order, C<$p_string>, C<$start>, C<$length>.

C<$p_string> is required
and must be a pointer to a string.
A new input text block is created with the string
pointed to by C<$p_string> as its text.
The current block is set to this new block.

The C<$start> argument is optional.
If C<$start> is defined and non-negative,
the current offset of the newly created block is set to
C<$start>.
If C<$start> is negative,
the current offset of the newly created block
is set to C<eoblock> + C<$start> + 1.
If C<$start> is not defined,
the current offset of the newly created block is set to 0.

The C<$length> argument is optional.
If C<$length> is defined and non-negative,
eoread is set to C<$start> + C<$length>.
If C<$length> is defined and negative,
eoread is set to C<eoblock> + C<$length> + 1.
If C<$length> is not defined, eoread is set to the eoblock
of the newly created, current block.

It is a hard failure
if C<$start> and C<$length> would produce an eoread which is after
eoblock.
It is also a hard failure
if C<$start> and C<$length>
would produce an eoread which is before the current offset.

Once C<read()> has created the a new block,
it does internal scanning
L<as described above|/"Internal scanning">.

C<read()> may be called multiple times during a parse.
Each C<read()> creates a new
B<input text block>.
On each call to C<read()>,
the new input text block
becomes the current text block.

B<Return values>:
If C<read()> terminates because
the current offset is equal to eoread,
C<read> succeeds
and returns eoread.
If C<read()> terminates because
of a parse event,
it returns the new current offset,
which is the event location of the parse event.
Parse events are described in detail in
L<a separate document|Marpa::R3::Event>.
A return value of Perl C<undef>
is reserved for soft failures.
Hard failures are thrown.
Currently, all failures are hard failures.

The C<read()> method is
the approximate equivalent of

=for Marpa::R3::Display
name: Block level read() equivalent
normalize-whitespace: 1

    sub block_level_read {
        my ($recce, $p_string, $offset, $length) = @_;
        my $block_id = $recce->block_new($p_string);
        $recce->block_set($block_id);
        $recce->block_move($offset, $length);
        return $recce->block_read();
    }

=for Marpa::R3::Display::End

=head2 resume()

=for Marpa::R3::Display
name: recognizer read/resume synopsis
partial: 1
normalize-whitespace: 1

    $pos = $recce->resume()

=for Marpa::R3::Display::End

The C<resume()> method resumes
internal scanning.
The recognizer must already have a current block set.
C<resume()> does not change
the current block setting of the recognizer.

The C<resume()> method takes two optional arguments,
call them, in order,
C<$start> and C<$length>.
If C<$start> is defined and non-negative,
the current offset of the current block is set to
C<$start>.
If C<$start> is defined and negative,
the current offset of the current block
is set to C<eoblock> + C<$start> + 1.
If C<$start> is not defined,
the current offset of the current block
is left unchanged.

The C<$length> argument is optional.
If C<$length> is defined and non-negative,
eoread of the current block is set to C<$start> + C<$length>.
If C<$length> is defined and negative,
eoread of the current block is set to C<eoblock> + C<$length> + 1.
If C<$length> is not defined,
the eoread of the current block is left unchanged.

Once the current offset and eoread are set,
internal scanning takes place.
Internal scanning is described in detail
L<above|/"Internal scanning">.

Parse events may occur during the C<resume()>
method.
Parse events are described in detail in
L<a separate document|Marpa::R3::Event>.

The C<resume()> method is often used,
in conjunction with external scanning,
to resume internal scanning.
External scanning is described
L<in its own document|Marpa::R3::Ext_Scan>.

B<Return values>:
On success, C<resume()> returns
the new current location.
A return value of Perl C<undef>
is reserved for soft failures.
Hard failures are thrown.
Currently, all failures are hard failures.

The C<resume()> method is
the approximate equivalent of

=for Marpa::R3::Display
name: Block level resume() equivalent
normalize-whitespace: 1

    sub block_level_resume {
        my ($recce, $offset, $length) = @_;
        $recce->block_move( $offset, $length );
        return $recce->block_read();
    }

=for Marpa::R3::Display::End

=head2 set()

=for Marpa::R3::Display
name: recognizer set() synopsis
normalize-whitespace: 1

    my $trace_fh;
    $recce->set( { trace_file_handle => $trace_fh } ); 

=for Marpa::R3::Display::End

This method allows recognizer settings to be changed after a
grammar is created.
The arguments to
C<set()> must be one or more hashes whose key-value pairs
are recognizer settings and their values.
The allowed recognizer settings are
L<described above|/"Recognizer settings">.

B<Return values>:
The return value is reserved for future use.
Hard failures are thrown.
Currently, all failures are hard failures.

=head2 recognizer value()

=for Marpa::R3::Display
name: Scanless recognizer synopsis
partial: 1
normalize-whitespace: 1

    my $value_ref = $recce->value( $self );

=for Marpa::R3::Display::End

The C<value()> method call is the basic method
for evaluating a parse.
It assumes the user wants a single parse value,
and wants to treat parse ambiguity as an error.
Apps which want to parse ambiguous grammars,
or which want to ignore ambiguity,
can do this by using L<a
valuer object|Marpa::R3::Valuer>.

The C<value()> method allows one optional argument.
Call this argument C<$self>.
If specified, C<$self>
explicitly specifies the per-parse argument for the
parse tree.
This per-parse argument can be a Perl scalar of any type,
but the most useful
type for a per-parse argument is a reference
(blessed or unblessed) to a hash or to an array.
The per-parse argument
is the first argument of all
Perl semantics closures.
When data does not conveniently fit into the bottom-up
flow of parse tree evaluation,
the per-parse argument
is useful for sharing data within
the tree.
Symbol tables are one example of the kind of data which parses often
require, but which it is not convenient to accumulate bottom-up.

The parse is evaluated with end-of-parse
at the current G1 location.
Setting end-of-parse is one of the advanced techniques
allowed when an app uses a
L<C<valuer object>|Marpa::R3::Valuer>
instead of the recognizer's C<value()> call.

B<Return values>:
If C<value()> successfully produced a parse value,
C<value()> returns a reference to that value.
Successful parses can have a Perl C<undef> as a parse value,
in which case
C<value()> returns a reference to Perl C<undef>.
If there was no valid parse according to the grammar,
the C<value()> method returns C<undef>.
If the parse was ambiguous, an error is thrown.
The error message will describe the ambiguity in detail.

The C<value()> method is
the approximate equivalent of

=for Marpa::R3::Display
name: recognizer value() equivalent
normalize-whitespace: 1

    sub recce_value_equivalent {
        my ($recce, $per_parse_arg) = @_;
        my $valuer = Marpa::R3::Valuer->new( { recognizer => $recce } );
        my $ambiguity_level = $valuer->ambiguity_level();
        return if $ambiguity_level == 0;
        if ( $ambiguity_level != 1 ) {
            my $ambiguous_status = $valuer->ambiguous();
            die "Parse of the input is ambiguous\n", $ambiguous_status;
        }
        my $value_ref = $valuer->value($per_parse_arg);
        die '$valuer->value(): No parse', "\n" if not $value_ref;
        return $value_ref;
    }

=for Marpa::R3::Display::End

=head1 Block methods

The block methods allow control
of the input on the block level.

=head2 block_move ()

=for Marpa::R3::Display
name: block_move() synopsis
normalize-whitespace: 1

    $recce->block_move( 0, -1 );

=for Marpa::R3::Display::End

C<block_move()> takes 2 optional arguments:
Call them, in order, C<$offset> and C<$length>.

If C<$offset> is defined and non-negative,
C<block_move()>
sets the current offset of the current block
to C<$offset>.
If C<$offset> is defined and negative,
C<block_move()>
sets the current offset of the current block
to C<eoblock> + C<$offset> + 1.
If C<$offset> is omitted,
the current offset of the current block
is not changed.

If length is defined and non-negative,
C<block_move()>
sets eoread in the current block
to C<$offset> + C<$length>.
If C<$length> is defined and negative, eoread is
set to C<eoblock> + C<$length> + 1.
If C<$length> is not defined,
the eoread of the current block
is not changed.

B<Return values>:
The return value is reserved for future use.
Currently all failures are hard failures.
Hard failures are thrown.

=head2 block_new ()

=for Marpa::R3::Display
name: block_new() synopsis
normalize-whitespace: 1

    my $main_block_id = $recce->block_new(\"abc");

=for Marpa::R3::Display::End

Takes one, required, argument, call it C<$p_string>.
Creates a new block.
The text of the newly created block
is the string pointed to by
C<$p_string>.
Call the ID of the newly created block,
C<$new_block>.

C<block_new()> does not change the current block settings.
In partiucular, note that C<block_new()>
does B<not> set the current block of the recognizer
to block C<$new_block>.
This is convenient for those applications which need to
create, for later use, a series of input blocks
based on the contents of the current block.

B<Return values>:
Returns the block ID of the newly created block.
All failures are hard failures.
Hard failures are thrown.

=head2 block_read ()

=for Marpa::R3::Display
name: block_read() synopsis
normalize-whitespace: 1

    $recce->block_read();

=for Marpa::R3::Display::End

C<block_read()> is the low-level method for
internal scanning.
C<block_read()> does not accept any arguments.

C<block_read()>
starts internal scanning of the current block
at the current offset.
Internal scanning is described
L<above|/"Internal scanning">.

Parse events may occur during C<block_read()>.
Parse events are described in detail in
L<a separate document|Marpa::R3::Event>.

B<Return values>:
If C<block_read()> terminates because
it reached eoread,
it returns eoread,
which will also be the current offset.
If C<block_read()> terminates because
of a parse event,
it returns the event location.
All failures are hard failures.
Hard failures are thrown.

=head2 block_set ()

=for Marpa::R3::Display
name: block_set() synopsis
normalize-whitespace: 1

    $recce->block_set($main_block_id);

=for Marpa::R3::Display::End

C<block_set()> takes one, required, argument.
call it C<$block_id>.
C<block_set()> set the current block to C<$block_id>.
C<block_set()> does not change the current offset or eoread
of any block.

B<Return values>:
The return value is reserved for future use.
Currently all failures are hard failures.
Hard failures are thrown.

=head2 block_progress ()

=for Marpa::R3::Display
name: block_progress() synopsis
normalize-whitespace: 1

    my ($block_id, $offset, $eoread) = $recce->block_progress( );

=for Marpa::R3::Display::End

=for Marpa::R3::Display
name: block_progress() synopsis 2
normalize-whitespace: 1

    ($block_id, $offset, $eoread) = $recce->block_progress( $main_block_id );

=for Marpa::R3::Display::End

C<block_progress()> takes one, optional, argument.
Call this argument, C<$block_id>.
If C<$block_id> is defined,
the specified block is C<$block_id>.
If C<$block_id> is undefined or missing,
the specified block is the current block
of the recognizer.
It is a hard failure
if C<$block_id> is undefined or missing,
and there is no current block of the recognizer.

B<Return values>:
C<block_progress()> returns, in order,
the ID of the specified block;
the current offset of the specified block;
and the current eoread of the specified block.
All failures are hard failures.
Hard failures are thrown.

=head1 Accessors

=head2 g1_to_block_first()

=for Marpa::R3::Display
name: Scanless g1_to_block_first() synopsis

        my ( $first_block, $first_offset ) =
            $recce->g1_to_block_first($g1_location);

=for Marpa::R3::Display::End

C<g1_to_block_first()> takes an G1 position as its
only argument.

B<Return values>:
C<g1_to_block_first()>
returns two values.
These are, in order,
the block index and the block offset
of the first block position that
corresponds to the G1 position.
All failures are hard failures.
Hard failures are thrown.

=head2 g1_to_block_last()

=for Marpa::R3::Display
name: Scanless g1_to_block_last() synopsis

        my ( $last_block, $last_offset ) =
            $recce->g1_to_block_last($g1_location);

=for Marpa::R3::Display::End

C<g1_to_block_last()> takes a G1 position as its
only argument.
C<g1_to_block_last()> returns the block position corresponding
to the G1 position.

The block position returned by C<g1_to_block_last()>
will be an "actual" position,
in the sense that there actually is a codepoint there.
For most apps,
the return values
of C<g1_to_block_first()>
and
C<g1_to_block_last()>
will be the elements of a
block range which corresponds exactly
to the G1 position.
But note that apps are allowed to use
inputs which
involve multiple blocks,
which are not continuous,
which do not move
forward monotonically,
and which may contain overlaps.
For these apps,
the return values of
C<g1_to_block_first()>
and
C<g1_to_block_last()>
will not be sufficient to
describe the block locations that correspond
to a G1 location.

B<Return values>:
C<g1_to_block_last()>
returns two values:
the block index and block offset
of the last block position that
corresponds to the G1 position.
All failures are hard failures.
Hard failures are thrown.

=head2 g1_literal()

=for Marpa::R3::Display
name: Scanless recognizer diagnostics
partial: 1
normalize-whitespace: 1

    my $last_expression = $recce->g1_literal( $g1_start, $g1_length );

=for Marpa::R3::Display::End

C<g1_literal()> takes two arguments,
call them C<$g1_start> and C<$g1_count>.
The two arguments are interpreted as a G1 span.
C<$g1_count> may be zero.

B<Return values>:
C<g1_literal()>
returns the substring of the input
stream corresponding to the G1 span.
If C<$g1_count> was zero,
the zero-length string will be returned.
All failures are hard failures.
Hard failures are thrown.

=head2 g1_pos()

=for Marpa::R3::Display
name: Scanless g1_pos() synopsis

    my $g1_pos = $recce->g1_pos();

=for Marpa::R3::Display::End

C<g1_pos()> does not take any arguments.

B<Return values>:
C<g1_pos()>
returns the current G1 location.
All failures are hard failures.
Hard failures are thrown.

=head2 exhausted()

=for Marpa::R3::Display
name: recognizer exhausted() synopsis

    my $exhausted_status = $recce->exhausted();

=for Marpa::R3::Display::End

The C<exhausted()> method does not take any arguments.
Parsing is said to be "exhausted" when the
recognizer will not accept any further input.
Marpa usually "does what you mean" in case of parse exhaustion,
but this method
allows the recognizer's exhaustion status to be discovered directly.
Parse exhaustion is discussed in detail in
L<a separate document|Marpa::R3::Exhaustion>.

B<Return values>:
On success,
C<exhausted()> returns a Perl true if parsing in a
recognizer is
exhausted, and a Perl false otherwise.
All failures are hard failures.
Hard failures are thrown.

=head2 input_length()

=for Marpa::R3::Display
name: recognizer input_length() synopsis
normalize-whitespace: 1

    my $input_length = $recce->input_length();

=for Marpa::R3::Display::End

=for Marpa::R3::Display
name: recognizer input_length() 1-arg synopsis
normalize-whitespace: 1

    $input_length = $recce->input_length(1);

=for Marpa::R3::Display::End

The C<input_length()> method accepts one, optional, argument,
call it C<$block_id>.
If C<$block_ix> is defined, the specified input block is C<$block_id>.
If C<$block_ix> is not defined,
the specified input block is 
the current input block of the recognizer.

B<Return values>:
C<input_length()> returns the length of the specified input block.
All failures are hard failures.
Hard failures are thrown.

=head2 last_completed()

=for Marpa::R3::Display
name: Scanless recognizer diagnostics
partial: 1
normalize-whitespace: 1

    sub show_last_expression {
        my ($self) = @_;
        my $recce = $self->{recce};
        my ( $g1_start, $g1_length ) = $recce->last_completed('Expression');
        return 'No expression was successfully parsed' if not defined $g1_start;
        my $last_expression = $recce->g1_literal( $g1_start, $g1_length );
        return "Last expression successfully parsed was: $last_expression";
    } ## end sub show_last_expression

=for Marpa::R3::Display::End

=for Marpa::R3::Display
name: Scanless recognizer diagnostics
partial: 1
normalize-whitespace: 1

    my ( $g1_start, $g1_length ) = $recce->last_completed('Expression');

=for Marpa::R3::Display::End

C<last_completed()> takes one, required, argument,
call it C<$symbol>.
C<$symbol> must be the name of a symbol.

B<Return values>:
On success,
C<last_completed()> returns two values.
These are in order, the G1 location and G1 count
that is G1
L<span|/"Spans"> of the most recent match
for C<$symbol>.
All failures are hard failures.
Hard failures are thrown.

=head2 line_column()

=for Marpa::R3::Display
name: trace example
partial: 1
normalize-whitespace: 1

    my $block_id = $pause_location[0];
    my $start = $pause_location[1];
    my $span_length = $pause_location[2];
    my ( $line, $column ) = $recce->line_column($block_id, $start);

=for Marpa::R3::Display::End

The C<line_column()> method accepts two optional arguments.
Call them, in order, 
C<$id> and C<$offset>.
If C<$id> is defined, the specified block is
the block whose id is C<$id>.
If C<$id> is not defined, the specified block is
the current block of the recognizer.
If C<$offset> is defined, the specified offset is C<$offset>.
If C<$offset> is not defined,
the specified offset is the current offset of the specified block.

Numbering of lines and columns is 1-based,
following UNIX editor tradition.
Except at the end of an input block
(eoblock)
the line and column will be that of an
actual character.
At eoblock the line number
will be that of the last line,
and the column number will be that of the last column
plus one.
Applications which want to treat eoblock as a special case
can test for it using the L<C<pos()> method|/"pos()">
and the L<C<input_length()> method|/"input_length()">.

For line numbering purposes,
a line is considered to end with any newline sequence
as defined in the
Unicode Specification 4.0.0, Section 5.8.
Specifically, a line ends with one of the following:

=over 4

=item *

a LF (line feed U+000A);

=item *

a CR (carriage return, U+000D), when it is not followed by a LF;

=item *

a CRLF sequence (U+000D,U+000A);

=item *

a NEL (next line, U+0085);

=item *

a VT (vertical tab, U+000B);

=item *

a FF (form feed, U+000C);

=item *

a LS (line separator, U+2028) or

=item *

a PS (paragraph separator, U+2029).

=back

C<line_column()> never changes the current block data.

B<Return values>:
C<line_column()> returns two values.
These are, in order,
the line and column position
of the specified offset in the specified block.
All failures are hard failures.
Hard failures are thrown.

=head2 literal()

=for Marpa::R3::Display
name: recognizer lexeme_alternative() synopsis
partial: 1
normalize-whitespace: 1

    my $literal = $recce->literal( $block_id, $offset, $length );

=for Marpa::R3::Display::End

C<literal()> accepts three arguments,
which are treated
as a L<block span|/"Block spans">.

B<Return values>:
C<literal()>
returns the substring of the input stream
corresponding to the specified block span.
All failures are hard failures.
Hard failures are thrown.

=head2 progress()

=for Marpa::R3::Display
name: Scanless progress() synopsis

    my $progress_output = $recce->progress();

=for Marpa::R3::Display::End

C<progress()> accepts one, optional, argument.
Call this argument, C<$g1_loc>.
If C<$g1_loc> is defined, the specified G1 location
is 
C<$g1_loc>.
Negative G1 locations are interpreted as
L<described above|/"Ranges">.
If C<$g1_loc> is not defined, the specified G1 location
is the current location.

C<progress()>
returns a reference to an array
that describes the progress
of a parse
at a location.
The progress reports returned by
the C<progress()> method
identify rules by their G1 rule ID.
G1 rule IDs can be converted to a list of the rule's
symbols using the L<C<rule()> method
of the grammar|Marpa::R3::Grammar/"rule()">.

B<Return values>:
On success,
C<progress()>
returns a reference to an array that contains
a progress report.
Details about this array can be found in
L<the document on progress reports|Marpa::R3::Progress>.
All failures are hard failures.
Hard failures are thrown.

=head2 progress_show()

=for Marpa::R3::Display
name: Scanless progress_show() synopsis
partial: 1
normalize-whitespace: 1

    my $progress_show_output = $recce->progress_show();

=for Marpa::R3::Display::End

C<progress_show()> takes two optional arguments,
call them C<$g1_start> and C<$g1_count>.
The two arguments are interpreted as a
L<G1 span|/"Spans">.
If C<$g1_start> is omitted or undefined,
it defaults to the current G1 location.
If C<$g1_count> is omitted or undefined,
it defaults to 1.


C<progress_show()>
returns a string showing
the progress of the parse.
For a description of its output,
see L<Marpa::R3::Progress>.
The output
of C<progress_show()>
is intended only for reading by humans.
The exact format is subject to change
and should not be relied on by applications.

As the above rules for the arguments
of C<progress_show()>
imply,
the method call C<< $recce->progress_show(0, -1) >>
will print progress reports for the entire parse.
With no arguments,
the string contains reports for
the current location.

B<Return values>:
On success,
C<progress_show()>
returns a string containing a human-readable
progress report of the parse.
Details about this report can be found in
L<the document on progress reports|Marpa::R3::Progress>.
All failures are hard failures.
Hard failures are thrown.

=head2 terminals_expected()

=for Marpa::R3::Display
name: Scanless terminals_expected() synopsis
normalize-whitespace: 1

    my @terminals_expected = @{$recce->terminals_expected()};

=for Marpa::R3::Display::End

C<terminals_expected()> accepts no arguments.
C<terminals_expected()> 
returns a reference to a list of strings,
where the strings are names of symbols.
A symbol name is in this list if and only if
the symbol is a lexeme acceptable at the current G1 location.

The presence of a lexeme in this list means, for example,
that the lexeme will be acceptable in the next call of the L<C<resume()>|/"resume()"> method.
This is highly useful for Ruby Slippers parsing.
A more fine-tuned approach is to identify the lexemes of interest
and create "predicted symbol" events for them.

Some lexemes are specified in the G1 rules
of the DSL as quoted strings
or as character classes,
This is convenient,
but the lexemes created in this way
do not have real names.
Instead, internal names, like
C<[Lex-1]> are created for them,
and these are what appear in the
list of strings
returned by C<terminals_expected()>.
If an application wants a quoted string
or a character class to have a
human-friendly name,
the application must provide that name explicitly,
by specifying the character class
or quoted string in an L0 rule.

B<Return values>:
On success,
C<terminals_expected()>
returns a reference to a list of strings.
The strings will be the names of lexemes acceptable to Marpa::R3
at the current G1 location.
All failures are hard failures.
Hard failures are thrown.

=head1 Details

This section contains additional explanations, not essential to understanding
the rest of this document.
Often they are formal or mathematical.
While some people find these helpful, others find them distracting,
which is why
they are segregated here.

=head2 Block span details

A B<block span arg> is a triple
C<< <$block_id_arg>, <$offset_arg>, <$length_arg> >>.
C<$block_id_arg> is undefined or a block ID.
C<offset> is undefined or a block offset.
C<length> is undefined or a length in characters (codepoints).

The specified block span is a triple C<< <block_id>, <offset>, <length> >>,
all of whose elements are defined.
Let C<eoread> be the eoread of C<block_id>.
The specified block span is determined
as follows:

=over 4

=item *
C<block_id> is C<$block_id_arg> if 
C<$block_id_arg> is defined.
Otherwise C<block_id> is the current block.
It is a hard failure if C<block_id> is undefined
and there is no current block.

=item *
If 
C<$offset_arg> is defined,
C<offset> is C<$offset_arg>,
converted to a non-negative offset.
It is a hard failure if C<offset_arg> cannot
be converted to a non-negative offset.
If C<$offset_arg> is not defined,
then C<offset> is the current offset
of C<block_id>.

=item *
If C<$length_arg> is not defined,
then C<length> is
C<eoread> - C<offset>.
If C<$length_arg> is defined,
C<length> is C<$length_arg>,
converted to a non-negative length.
It is a hard failure if C<$length_arg> cannot
be converted to a non-negative length.
If 
C<offset> + C<length> would be greater than C<eoread>,
C<length> is truncated to C<eoread> - C<offset>.

=back

=head1 COPYRIGHT AND LICENSE

=for Marpa::R3::Display
ignore: 1

  Marpa::R3 is Copyright (C) 2018, Jeffrey Kegler.

  This module is free software; you can redistribute it and/or modify it
  under the same terms as Perl 5.10.1. For more details, see the full text
  of the licenses in the directory LICENSES.

  This program is distributed in the hope that it will be
  useful, but without any warranty; without even the implied
  warranty of merchantability or fitness for a particular purpose.

=for Marpa::R3::Display::End

=cut

# vim: expandtab shiftwidth=4:
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)