The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Marpa::R3::Recognizer - Recognizer objects

Synopsis

    my $recce = Marpa::R3::Recognizer->new( { grammar => $grammar } );
    my $self = bless { grammar => $grammar }, 'My_Actions';
    $self->{recce} = $recce;

    if ( not defined eval { $recce->read($p_input_string); 1 }
        )
    {
        ## Add last expression found, and rethrow
        my $eval_error = $EVAL_ERROR;
        chomp $eval_error;
        die $self->show_last_expression(), "\n", $eval_error, "\n";
    } ## end if ( not defined eval { $event_count = $recce->read...})

    my $value_ref = $recce->value( $self );
    if ( not defined $value_ref ) {
        die $self->show_last_expression(), "\n",
            "No parse was found, after reading the entire input\n";
    }
    package My_Actions;
    sub do_parens    { return $_[1]->[1] }
    sub do_add       { return $_[1]->[0] + $_[1]->[2] }
    sub do_subtract  { return $_[1]->[0] - $_[1]->[2] }
    sub do_multiply  { return $_[1]->[0] * $_[1]->[2] }
    sub do_divide    { return $_[1]->[0] / $_[1]->[2] }
    sub do_pow       { return $_[1]->[0]**$_[1]->[2] }
    sub do_first_arg { return $_[1]->[0] }
    sub do_script    { return join q{ }, @{$_[1]} }

About this document

This page is the reference document for Marpa::R3's recognizer objects. The Marpa::R3 DSL contains its own internal scanner, integrated into its syntax. In this document, use of the internal scanner is called internal scanning.

Many applications find it useful or necessary to do their own scanning, and Marpa::R3 allows this. When an application bypasses the internal scanner and does its own scanning, this document calls it external scanning. External scanning can be used instead of internal scanning, but Marpa::R3 also allows application to switch back and forth between internal and external scanning. External scanning is described in its own document.

Block position

If an app uses the basic internal scanning method, read(), and accepts its defaults, the Marpa::R3 input is very simple. The input is provided as a pointer to a string. Scanning starts at the first position and continues to the last character of the string.

Many applications fit this model. But many common practical applications want to do one or more of the following

  • Stop reading the current file, read from another one, then resume reading the current file. Preprocessors, including the one for the C language, commonly need to do this.

  • Use "Include" files as just described, but also nest them. The C preprocessor does this.

  • Allow "here-docs". Here-docs are similar to "include" files, except they are not separate files, but distant portions of the current file. Perl does this.

  • Implement "if/else" preprocessor logic, skipping sections of the current input. The C preprocessor does this.

  • Allow externally scanned symbols that do not refer to any text in the main input string. For reasonable error messages to be generated, some sort of literal equivalent of the symbol needs to be available.

To implement the above features, an app must be able to switch among multiple input strings, and be able to jump forward and backward within those input strings. Marpa::R3's input model has these abilities.

Marpa::R3 allows multiple input strings. In these documents, these input strings are called input text blocks or, more often, blocks. Marpa::R3 will convert all blocks into Unicode. Characters in Marpa::R3 are Unicode codepoints. Numbering of characters is 0-based.

Reading is always done from the the current input text block, or current block. When the recognizer is created, there is no current input text block. The read() method sets the current block implicitly. Other methods allow the app to create blocks and to set the current block explicitly.

Location in the input is a duple: <block_id, block_offset>, where block_id is a block ID, and block_offset is a block offset. Each such duple uniquely identifies a block location.

A block ID uniquely identifies a block. Block ID happens to be an integer, but it should be treated as opaque.

The block offset is a 0-based integer. By default, the next character read is the character at the current block offset of the current block. When context makes it clear, the block offset is called simply the offset. All blocks, whether current or not, maintain a current offset.

An offset of L, where L is the length of the block under discussion, is called the end of the physical block, end of block, or eoblock. Eoblock represents the offset immediately after the last character of the physical block.

In any block, its eoblock is the highest valid block offset. When the current offset of a block is equal to its eoblock, that block has been read all the way to the end, and there are no more characters to be read. For an app to read more characters when the current offset of the current block is at eoblock, either the app must switch blocks, or the app must move the block offset of the current block.

Every block maintains a current end of read. The current end of read is also referred to as the block's end of read or eoread. Reading of a block ends when the next character read would be the one at eoread. In effect, eoread is a temporary eoblock. It is often used for to tell the recognizer to read only a portion of a block.

When a block is initialized, its current offset is 0 and its eoread is equal to its eoblock.

In these documents, unless stated otherwise, the bare terms location and position will mean block position as described in this section. When these documents refer to the current block data, they will mean these 3 data items:

  • the recognizer's current block setting;

  • the current offset of that current block; and

  • the current eoread of that current block.

G1 location

In addition to block position, Marpa::R3 apps sometimes have to deal with position in terms of lexemes. (Marpa::R3's lexemes are similar to what in other parsers are called "tokens".) Position in terms of lexemes is called G1 Earley set location or, more often, G1 location.

G1 location is 0-based -- The first lexeme is at G1 location 0. Every G1 location corresponds to at least one lexeme. But Marpa::R3 allows ambiguous lexemes -- more than one lexeme can be read at a single G1 location.

G1 location can be ignored most of the time, but it is useful for tracing the G1 grammar and for advanced techniques.

G1 fenceposts

Restating the definition of G1 location above more pedantically, the G1 locations are an ordered set of sets of tokens. G1 locations contain sets of tokens.

It is sometimes useful to use an idea of locations between G1 locations -- This is the classic "fencepost" issue -- sometimes you want to count sections of fence, and in other cases it is more convenient to count fenceposts.

Let the G1 locations be from 0 to N. If N is greater than 1 and less than N, G1 fencepost N is before G1 location N and after G1 location N-1. G1 fencepost 0 is before G1 location 0. G1 fencepost N is after G1 location N

The above implies that

  • If there are N G1 locations, there are N+1 G1 fenceposts.

  • G1 location I is always between G1 fencepost I and G1 fencepost I+1.

Universes

In order to have a common language for locations, spans, ranges and offsets, the concept of a universe of locations is useful. Each type of location has its own "universe".

For blocks, the universe is the ordered set of characters in the input text block under discussion, and end of universe is the last physical character of the block. This will be the character at offset eoblock - 1, that is, the character immediately before eoblock. Eoread has no effect on end-of-universe.

For G1 locations, the universe is the ordered set of G1 locations, and the end of universe is the last G1 location of that ordered set.

Negative offsets and positions

The negative offset -N is the same as the offset EOU - N + 1, where EOU is the positive offset of the "end of universe" element. Therefore, suppose that block 7 has a length of 42. Since block offsets are 0-based, the last character in block 7 will be at offset 41, so that offset 41 is the "end of universe". Therefore,

  • Block position 7, -1 is the same as block position 7, 41.

  • Block position 7, -2 is the same as block position 7, 40.

  • Block position 7, -42 is the same as block position 7, 0.

Ranges

In this document, we will refer to ordered subsets of contiguous locations as either ranges or spans. A range is an ordered set of contiguous locations specified by start location and end location: [S ... E]. Alternatively, a range can be seen as the string of locations from offset S to offset E. A range is inclusive, so that it includes the location S as well as the location E. In a range of length 1, S will be equal to E.

Spans

A span is an ordered set of contiguous locations specified by start location and length: [S, L]. A span is a subset of a universe of locations, as was described above for ranges.

The range corresponding to the span [S, L] is [S ... (S+L)-1]. The span corresponding to the range [S ... E] is [S, (S-E)+1]. A span with a negative length is interpreted as if it was the range with that same pair of values.

In general, spans are more convenient for programming. But when fencepost issues are important, spans require a lot of mental arithmetic, and a discussion that uses ranges is easier to follow.

For the sake of some examples, consider a 0-based input text block of length 100.

  • The entire block is the range [0 ... -1] and the span [0, -1].

  • The first 42 characters of the input stream are the range [0 ... 41] and the span [0, 42].

  • The entire block, except for the last character, is the range [0 ... -2] and the span [0, -2].

  • The substring consisting only of the last character is the range [-1 ... -1] and the span [-1, 1].

  • The substring which consists of the last 3 characters is the range [-3 ... -1] and the span [-3, 3].

  • The substring which consists of only the third-to-last character is the range [-3 ... -3] and the span [-3, 1].

Block spans

A block span is a span in terms of block location. It is specified as a triple <$block_id>, <$offset>, <$length>. Sometimes one or more elements of a block span are undefined. (For example, this can happen when a block span is part of a set of arguments.) If $block_id is missing or undefined, the specified block is the current block. If $offset is missing or undefined, the specified offset is the current offset of the specified block. If $length is missing or undefined, the specified length is to the eoread of the specified block. If $length is defined and positive, but would result in a span that goes beyond eoread, the specified length is truncated. Block span is defined more formally below.

Internal scanning

Several Marpa::R3 method calls do internal scanning. The most basic of these is the read() method.

Internal scanning requires there to be a current block. The current block will have a current offset and an eoread. Internal scanning reads and parses the text of the current block, starting from the current offset, until one of the following occurs:

  • The next character read would be at eoread. In this case, current offset is equal to eoread. The character at eoread is not read.

  • A parse event is triggered. In this case, the current offset is set to the event locaiton.

  • A soft failure occurs. The current offset is set to the location of the soft failure.

  • A hard failure occurs. The current block data is undefined, but Marpa::R3 attempts to set it to the location of the hard failure.

A special case of internal scanning is when, at the start of internal scanning, eoread is equal to the current offset. This happens, for example, if the $length argument to the read() method is set to 0. In this case, as the above rules imply, internal scanning stops immediately, leaving current offset unchanged. This can be used as a technique to set a new input text block without reading it, perhaps in preparation for external scanning.

Evaluation

After scanning, an app almost always wants to evaluate the parse. This can be done using the recognizer's value() method. For advanced techniques, such as retrieving multiple values from an ambigious parse, Marpa::R3's valuer objects can be used.

Recognizer settings

The recognizer settings are the named arguments accepted by the recognizer's new(), and/or its set() method.

event_is_active

    my $slr = Marpa::R3::Recognizer->new(
        {
            grammar         => $grammar,
            event_is_active => { 'before c' => 1, 'after b' => 0 },
            event_handlers  => {
                'after a'  => $after_handler,
                'after b'  => $after_handler,
                'after c'  => $after_handler,
                'after d'  => $after_handler,
                'before a' => $before_handler,
                'before b' => $before_handler,
                'before c' => $before_handler,
                'before d' => $before_handler,
            }
        }
    );

The event_is_active recognizer setting changes the activation setting of events. Its value should be a reference to a hash, in which the key of every entry is an event name, and its value is either 0 or 1. If the value is 1, the event named in the hash key will be activated when the recognizer starts. If the value is 0, the event named in the hash key will be inactive when the recognizer starts. The event_is_active setting is only allowed with the recognizer's new() method.

The setting in the event_is_active hash overrides the activation setting in the grammar. The setting will be in effect before events at earleme 0 are triggered, and before any of the input stream is read.

The activate() method can also be used to change an event's activation, and that is usually preferable. But events at earleme 0 trigger during the recognizer's new() method -- they can not be affected by calls of the activate() method. If an event is initialized to inactive in the grammar, the event_is_active recognizer setting is the only way for a recognizer to allow that event to be active at earleme 0. Similarly, if an event is initialized to active in the grammar, the event_is_active recognizer setting is the only way for a recognizer to set that event to be inactive at earleme 0.

event_handlers

    @results = ();
    $recce   = Marpa::R3::Recognizer->new(
        {
            grammar        => $grammar1,
            event_handlers => {
                A => sub () { push @results, 'A'; 'ok' },
                B => sub () { push @results, 'B'; 'ok' },
                C => sub () { push @results, 'C'; 'ok' },
            }
        }
    );

The event_handlers recognizer setting sets the callbacks which handle events. Its value should be a reference to a hash, in which the key of every entry is an event name, and each value of the hash must be an anonymous function. For an individual entry of this hash, call the key ev and call its value func. Event ev will be handled by calling function func.

If any event occurs which does not have a handler, it is a hard failure. The event_handler named argument is only allowed with the recognizer's new() method. Full details of the use of event handlers, with many examples, are in a separate document.

grammar

The value of the grammar setting must be a grammar object. The new() method is required to have a grammar setting. The grammar setting is only allowed with the new() method. Once the recognizer is created, the grammar cannot be changed.

too_many_earley_items

The too_many_earley_items setting is optional, and very few applications will need it. If specified, it sets the Earley item warning threshold to a value other than its default. If an Earley set becomes larger than the Earley item warning threshold, a recognizer event is generated, and a warning is printed to the trace file handle.

Marpa parses from any BNF, even when the BNF describes a highly ambiguous grammar. But highly ambiguous grammars can produce Earley sets large enough to exceed the memory capacity of even the largest machines. And parsing for less ambiguous grammars can be impractically slow.

Ideally too_many_earley_items is set at the "point of no return" -- an Earley set size which, when reached, indicates that the Earley sets are destined to grow out of control. By default, Marpa calculates an Earley item warning threshold for the G1 recognizer based on the size of the G1 grammar, and for each L0 recognizer based on the size of the L0 grammar. These default thresholds will never be less than 100. Marpa::R3's default is the result of considerable experience and almost all users will be happy with it.

If the Earley item warning threshold is changed from its default, the change applies to both L0 and G1 -- currently there is no way to set them separately. If the Earley item warning threshold is set to 0, no recognizer event is generated, and warnings about large Earley sets are turned off. In fact, Marpa::R3's default values are highly successful in identifying the true "point of no return", beyond which it is pointless to continue, so that turning these warnings off will rarely be something that an application wants to do.

The too_many_earley_items setting is allowed by both the recognizer's new() and its set() method.

trace_terminals

If non-zero, traces the lexemes -- those tokens passed from the L0 parser to the G1 parser. This recognizer setting is the best way to follow what the L0 parser is doing, and it is also very helpful for tracing the G1 parser. The trace_terminals setting is allowed by both the recognizer's new() and its set() method.

trace_values

The value of the trace_values setting is a numeric trace level. If the numeric trace level is 1, Marpa prints tracing information as values are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. Traces are written to the trace file handle. The trace_values setting is allowed by both the recognizer's new() and its set() method.

trace_file_handle

The value is a file handle. Trace output and warning messages go to the trace file handle. By default, the trace file handle is inherited from the grammar. The trace_file_handle setting is allowed by both the recognizer's new() and its set() method.

Constructor

    my $recce = Marpa::R3::Recognizer->new( { grammar => $grammar } );

The new() method is the constructor for recognizers. The arguments to the new() constructor must be one or more hashes of named arguments, where each hash key is a recognizer setting. The grammar recognizer setting is required. All other recognizer settings are optional. For more on recognizer settings, see the section describing them.

Parse events may occur during the recognizer's new() constructor. Parse events are described in detail in a separate document.

Mutators

activate()

        $recce->activate($_, 0) for @events;

The activate() method allows the recognizer to deactivate and reactivate parse events. Parse events are described in a separate document.

The activate() method takes two arguments. The first is the name of an event, and the second (optional) argument is 0 or 1. If the argument is 0, the event is deactivated. If the argument is 1, the event is activated. An argument of 1 is the default. Since an recognizer always starts with all defined events activated, 0 will probably be more common as the second argument to activate()

Although they are not reported until the call of the read() method, location 0 events are triggered in the recognizer's constructor, before the activate() method can be called. To deactivate location zero events, use the event_is_active recognizer setting in that recognizer's new().

The overhead imposed by events can be reduced by using the activate() method. But making many calls to the activate() method purely for efficiency purposes will be counter-productive. Also, deactivated events continue to impose a small overhead, so if an event is never used, it should be commented out in the DSL.

Return values: The return value is reserved for future use. Currently all failures are hard failures. Hard failures are thrown.

lexeme_priority_set()

        $recce->lexeme_priority_set( 'prefix lexeme', -1 );

Takes as its first argument the name of a lexeme and changes the priority of that lexeme to the value of its second argument. Both arguments are required.

Changing the lexeme priority is a very flexible technique. It can, in effect, allow an application to switch lexers.

Return values: On success, returns the old priority value. Failure is thrown.

read()

    $recce->read($p_input_string);
    $recce->read( \$string, 0, 0 );

read() is the basic method for internal scanning. It creates a new block and scans it according to its arguments. read() takes three arguments, only the first of which is required. Call them, in order, $p_string, $start, $length.

$p_string is required and must be a pointer to a string. A new input text block is created with the string pointed to by $p_string as its text. The current block is set to this new block.

The $start argument is optional. If $start is defined and non-negative, the current offset of the newly created block is set to $start. If $start is negative, the current offset of the newly created block is set to eoblock + $start + 1. If $start is not defined, the current offset of the newly created block is set to 0.

The $length argument is optional. If $length is defined and non-negative, eoread is set to $start + $length. If $length is defined and negative, eoread is set to eoblock + $length + 1. If $length is not defined, eoread is set to the eoblock of the newly created, current block.

It is a hard failure if $start and $length would produce an eoread which is after eoblock. It is also a hard failure if $start and $length would produce an eoread which is before the current offset.

Once read() has created the a new block, it does internal scanning as described above.

read() may be called multiple times during a parse. Each read() creates a new input text block. On each call to read(), the new input text block becomes the current text block.

Return values: If read() terminates because the current offset is equal to eoread, read succeeds and returns eoread. If read() terminates because of a parse event, it returns the new current offset, which is the event location of the parse event. Parse events are described in detail in a separate document. A return value of Perl undef is reserved for soft failures. Hard failures are thrown. Currently, all failures are hard failures.

The read() method is the approximate equivalent of

    sub block_level_read {
        my ($recce, $p_string, $offset, $length) = @_;
        my $block_id = $recce->block_new($p_string);
        $recce->block_set($block_id);
        $recce->block_move($offset, $length);
        return $recce->block_read();
    }

resume()

    $pos = $recce->resume()

The resume() method resumes internal scanning. The recognizer must already have a current block set. resume() does not change the current block setting of the recognizer.

The resume() method takes two optional arguments, call them, in order, $start and $length. If $start is defined and non-negative, the current offset of the current block is set to $start. If $start is defined and negative, the current offset of the current block is set to eoblock + $start + 1. If $start is not defined, the current offset of the current block is left unchanged.

The $length argument is optional. If $length is defined and non-negative, eoread of the current block is set to $start + $length. If $length is defined and negative, eoread of the current block is set to eoblock + $length + 1. If $length is not defined, the eoread of the current block is left unchanged.

Once the current offset and eoread are set, internal scanning takes place. Internal scanning is described in detail above.

Parse events may occur during the resume() method. Parse events are described in detail in a separate document.

The resume() method is often used, in conjunction with external scanning, to resume internal scanning. External scanning is described in its own document.

Return values: On success, resume() returns the new current location. A return value of Perl undef is reserved for soft failures. Hard failures are thrown. Currently, all failures are hard failures.

The resume() method is the approximate equivalent of

    sub block_level_resume {
        my ($recce, $offset, $length) = @_;
        $recce->block_move( $offset, $length );
        return $recce->block_read();
    }

set()

    my $trace_fh;
    $recce->set( { trace_file_handle => $trace_fh } ); 

This method allows recognizer settings to be changed after a grammar is created. The arguments to set() must be one or more hashes whose key-value pairs are recognizer settings and their values. The allowed recognizer settings are described above.

Return values: The return value is reserved for future use. Hard failures are thrown. Currently, all failures are hard failures.

recognizer value()

    my $value_ref = $recce->value( $self );

The value() method call is the basic method for evaluating a parse. It assumes the user wants a single parse value, and wants to treat parse ambiguity as an error. Apps which want to parse ambiguous grammars, or which want to ignore ambiguity, can do this by using a valuer object.

The value() method allows one optional argument. Call this argument $self. If specified, $self explicitly specifies the per-parse argument for the parse tree. This per-parse argument can be a Perl scalar of any type, but the most useful type for a per-parse argument is a reference (blessed or unblessed) to a hash or to an array. The per-parse argument is the first argument of all Perl semantics closures. When data does not conveniently fit into the bottom-up flow of parse tree evaluation, the per-parse argument is useful for sharing data within the tree. Symbol tables are one example of the kind of data which parses often require, but which it is not convenient to accumulate bottom-up.

The parse is evaluated with end-of-parse at the current G1 location. Setting end-of-parse is one of the advanced techniques allowed when an app uses a valuer object instead of the recognizer's value() call.

Return values: If value() successfully produced a parse value, value() returns a reference to that value. Successful parses can have a Perl undef as a parse value, in which case value() returns a reference to Perl undef. If there was no valid parse according to the grammar, the value() method returns undef. If the parse was ambiguous, an error is thrown. The error message will describe the ambiguity in detail.

The value() method is the approximate equivalent of

    sub recce_value_equivalent {
        my ($recce, $per_parse_arg) = @_;
        my $valuer = Marpa::R3::Valuer->new( { recognizer => $recce } );
        my $ambiguity_level = $valuer->ambiguity_level();
        return if $ambiguity_level == 0;
        if ( $ambiguity_level != 1 ) {
            my $ambiguous_status = $valuer->ambiguous();
            die "Parse of the input is ambiguous\n", $ambiguous_status;
        }
        my $value_ref = $valuer->value($per_parse_arg);
        die '$valuer->value(): No parse', "\n" if not $value_ref;
        return $value_ref;
    }

Block methods

The block methods allow control of the input on the block level.

block_move ()

    $recce->block_move( 0, -1 );

block_move() takes 2 optional arguments: Call them, in order, $offset and $length.

If $offset is defined and non-negative, block_move() sets the current offset of the current block to $offset. If $offset is defined and negative, block_move() sets the current offset of the current block to eoblock + $offset + 1. If $offset is omitted, the current offset of the current block is not changed.

If length is defined and non-negative, block_move() sets eoread in the current block to $offset + $length. If $length is defined and negative, eoread is set to eoblock + $length + 1. If $length is not defined, the eoread of the current block is not changed.

Return values: The return value is reserved for future use. Currently all failures are hard failures. Hard failures are thrown.

block_new ()

    my $main_block_id = $recce->block_new(\"abc");

Takes one, required, argument, call it $p_string. Creates a new block. The text of the newly created block is the string pointed to by $p_string. Call the ID of the newly created block, $new_block.

block_new() does not change the current block settings. In partiucular, note that block_new() does not set the current block of the recognizer to block $new_block. This is convenient for those applications which need to create, for later use, a series of input blocks based on the contents of the current block.

Return values: Returns the block ID of the newly created block. All failures are hard failures. Hard failures are thrown.

block_read ()

    $recce->block_read();

block_read() is the low-level method for internal scanning. block_read() does not accept any arguments.

block_read() starts internal scanning of the current block at the current offset. Internal scanning is described above.

Parse events may occur during block_read(). Parse events are described in detail in a separate document.

Return values: If block_read() terminates because it reached eoread, it returns eoread, which will also be the current offset. If block_read() terminates because of a parse event, it returns the event location. All failures are hard failures. Hard failures are thrown.

block_set ()

    $recce->block_set($main_block_id);

block_set() takes one, required, argument. call it $block_id. block_set() set the current block to $block_id. block_set() does not change the current offset or eoread of any block.

Return values: The return value is reserved for future use. Currently all failures are hard failures. Hard failures are thrown.

block_progress ()

    my ($block_id, $offset, $eoread) = $recce->block_progress( );
    ($block_id, $offset, $eoread) = $recce->block_progress( $main_block_id );

block_progress() takes one, optional, argument. Call this argument, $block_id. If $block_id is defined, the specified block is $block_id. If $block_id is undefined or missing, the specified block is the current block of the recognizer. It is a hard failure if $block_id is undefined or missing, and there is no current block of the recognizer.

Return values: block_progress() returns, in order, the ID of the specified block; the current offset of the specified block; and the current eoread of the specified block. All failures are hard failures. Hard failures are thrown.

Accessors

g1_to_block_first()

        my ( $first_block, $first_offset ) =
            $recce->g1_to_block_first($g1_location);

g1_to_block_first() takes an G1 position as its only argument.

Return values: g1_to_block_first() returns two values. These are, in order, the block index and the block offset of the first block position that corresponds to the G1 position. All failures are hard failures. Hard failures are thrown.

g1_to_block_last()

        my ( $last_block, $last_offset ) =
            $recce->g1_to_block_last($g1_location);

g1_to_block_last() takes a G1 position as its only argument. g1_to_block_last() returns the block position corresponding to the G1 position.

The block position returned by g1_to_block_last() will be an "actual" position, in the sense that there actually is a codepoint there. For most apps, the return values of g1_to_block_first() and g1_to_block_last() will be the elements of a block range which corresponds exactly to the G1 position. But note that apps are allowed to use inputs which involve multiple blocks, which are not continuous, which do not move forward monotonically, and which may contain overlaps. For these apps, the return values of g1_to_block_first() and g1_to_block_last() will not be sufficient to describe the block locations that correspond to a G1 location.

Return values: g1_to_block_last() returns two values: the block index and block offset of the last block position that corresponds to the G1 position. All failures are hard failures. Hard failures are thrown.

g1_literal()

    my $last_expression = $recce->g1_literal( $g1_start, $g1_length );

g1_literal() takes two arguments, call them $g1_start and $g1_count. The two arguments are interpreted as a G1 span. $g1_count may be zero.

Return values: g1_literal() returns the substring of the input stream corresponding to the G1 span. If $g1_count was zero, the zero-length string will be returned. All failures are hard failures. Hard failures are thrown.

g1_pos()

    my $g1_pos = $recce->g1_pos();

g1_pos() does not take any arguments.

Return values: g1_pos() returns the current G1 location. All failures are hard failures. Hard failures are thrown.

exhausted()

    my $exhausted_status = $recce->exhausted();

The exhausted() method does not take any arguments. Parsing is said to be "exhausted" when the recognizer will not accept any further input. Marpa usually "does what you mean" in case of parse exhaustion, but this method allows the recognizer's exhaustion status to be discovered directly. Parse exhaustion is discussed in detail in a separate document.

Return values: On success, exhausted() returns a Perl true if parsing in a recognizer is exhausted, and a Perl false otherwise. All failures are hard failures. Hard failures are thrown.

input_length()

    my $input_length = $recce->input_length();
    $input_length = $recce->input_length(1);

The input_length() method accepts one, optional, argument, call it $block_id. If $block_ix is defined, the specified input block is $block_id. If $block_ix is not defined, the specified input block is the current input block of the recognizer.

Return values: input_length() returns the length of the specified input block. All failures are hard failures. Hard failures are thrown.

last_completed()

    sub show_last_expression {
        my ($self) = @_;
        my $recce = $self->{recce};
        my ( $g1_start, $g1_length ) = $recce->last_completed('Expression');
        return 'No expression was successfully parsed' if not defined $g1_start;
        my $last_expression = $recce->g1_literal( $g1_start, $g1_length );
        return "Last expression successfully parsed was: $last_expression";
    } ## end sub show_last_expression
    my ( $g1_start, $g1_length ) = $recce->last_completed('Expression');

last_completed() takes one, required, argument, call it $symbol. $symbol must be the name of a symbol.

Return values: On success, last_completed() returns two values. These are in order, the G1 location and G1 count that is G1 span of the most recent match for $symbol. All failures are hard failures. Hard failures are thrown.

line_column()

    my $block_id = $pause_location[0];
    my $start = $pause_location[1];
    my $span_length = $pause_location[2];
    my ( $line, $column ) = $recce->line_column($block_id, $start);

The line_column() method accepts two optional arguments. Call them, in order, $id and $offset. If $id is defined, the specified block is the block whose id is $id. If $id is not defined, the specified block is the current block of the recognizer. If $offset is defined, the specified offset is $offset. If $offset is not defined, the specified offset is the current offset of the specified block.

Numbering of lines and columns is 1-based, following UNIX editor tradition. Except at the end of an input block (eoblock) the line and column will be that of an actual character. At eoblock the line number will be that of the last line, and the column number will be that of the last column plus one. Applications which want to treat eoblock as a special case can test for it using the pos() method and the input_length() method.

For line numbering purposes, a line is considered to end with any newline sequence as defined in the Unicode Specification 4.0.0, Section 5.8. Specifically, a line ends with one of the following:

  • a LF (line feed U+000A);

  • a CR (carriage return, U+000D), when it is not followed by a LF;

  • a CRLF sequence (U+000D,U+000A);

  • a NEL (next line, U+0085);

  • a VT (vertical tab, U+000B);

  • a FF (form feed, U+000C);

  • a LS (line separator, U+2028) or

  • a PS (paragraph separator, U+2029).

line_column() never changes the current block data.

Return values: line_column() returns two values. These are, in order, the line and column position of the specified offset in the specified block. All failures are hard failures. Hard failures are thrown.

literal()

    my $literal = $recce->literal( $block_id, $offset, $length );

literal() accepts three arguments, which are treated as a block span.

Return values: literal() returns the substring of the input stream corresponding to the specified block span. All failures are hard failures. Hard failures are thrown.

progress()

    my $progress_output = $recce->progress();

progress() accepts one, optional, argument. Call this argument, $g1_loc. If $g1_loc is defined, the specified G1 location is $g1_loc. Negative G1 locations are interpreted as described above. If $g1_loc is not defined, the specified G1 location is the current location.

progress() returns a reference to an array that describes the progress of a parse at a location. The progress reports returned by the progress() method identify rules by their G1 rule ID. G1 rule IDs can be converted to a list of the rule's symbols using the rule() method of the grammar.

Return values: On success, progress() returns a reference to an array that contains a progress report. Details about this array can be found in the document on progress reports. All failures are hard failures. Hard failures are thrown.

progress_show()

    my $progress_show_output = $recce->progress_show();

progress_show() takes two optional arguments, call them $g1_start and $g1_count. The two arguments are interpreted as a G1 span. If $g1_start is omitted or undefined, it defaults to the current G1 location. If $g1_count is omitted or undefined, it defaults to 1.

progress_show() returns a string showing the progress of the parse. For a description of its output, see Marpa::R3::Progress. The output of progress_show() is intended only for reading by humans. The exact format is subject to change and should not be relied on by applications.

As the above rules for the arguments of progress_show() imply, the method call $recce->progress_show(0, -1) will print progress reports for the entire parse. With no arguments, the string contains reports for the current location.

Return values: On success, progress_show() returns a string containing a human-readable progress report of the parse. Details about this report can be found in the document on progress reports. All failures are hard failures. Hard failures are thrown.

terminals_expected()

    my @terminals_expected = @{$recce->terminals_expected()};

terminals_expected() accepts no arguments. terminals_expected() returns a reference to a list of strings, where the strings are names of symbols. A symbol name is in this list if and only if the symbol is a lexeme acceptable at the current G1 location.

The presence of a lexeme in this list means, for example, that the lexeme will be acceptable in the next call of the resume() method. This is highly useful for Ruby Slippers parsing. A more fine-tuned approach is to identify the lexemes of interest and create "predicted symbol" events for them.

Some lexemes are specified in the G1 rules of the DSL as quoted strings or as character classes, This is convenient, but the lexemes created in this way do not have real names. Instead, internal names, like [Lex-1] are created for them, and these are what appear in the list of strings returned by terminals_expected(). If an application wants a quoted string or a character class to have a human-friendly name, the application must provide that name explicitly, by specifying the character class or quoted string in an L0 rule.

Return values: On success, terminals_expected() returns a reference to a list of strings. The strings will be the names of lexemes acceptable to Marpa::R3 at the current G1 location. All failures are hard failures. Hard failures are thrown.

Details

This section contains additional explanations, not essential to understanding the rest of this document. Often they are formal or mathematical. While some people find these helpful, others find them distracting, which is why they are segregated here.

Block span details

A block span arg is a triple <$block_id_arg>, <$offset_arg>, <$length_arg>. $block_id_arg is undefined or a block ID. offset is undefined or a block offset. length is undefined or a length in characters (codepoints).

The specified block span is a triple <block_id>, <offset>, <length>, all of whose elements are defined. Let eoread be the eoread of block_id. The specified block span is determined as follows:

  • block_id is $block_id_arg if $block_id_arg is defined. Otherwise block_id is the current block. It is a hard failure if block_id is undefined and there is no current block.

  • If $offset_arg is defined, offset is $offset_arg, converted to a non-negative offset. It is a hard failure if offset_arg cannot be converted to a non-negative offset. If $offset_arg is not defined, then offset is the current offset of block_id.

  • If $length_arg is not defined, then length is eoread - offset. If $length_arg is defined, length is $length_arg, converted to a non-negative length. It is a hard failure if $length_arg cannot be converted to a non-negative length. If offset + length would be greater than eoread, length is truncated to eoread - offset.

COPYRIGHT AND LICENSE

  Marpa::R3 is Copyright (C) 2018, Jeffrey Kegler.

  This module is free software; you can redistribute it and/or modify it
  under the same terms as Perl 5.10.1. For more details, see the full text
  of the licenses in the directory LICENSES.

  This program is distributed in the hope that it will be
  useful, but without any warranty; without even the implied
  warranty of merchantability or fitness for a particular purpose.