The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
package yagg::Tutorial;

# This is only a container for the yagg tutorial to allow viewing
# tutorial-style documentation by way of the perldoc utility.

1;
__END__

=head1 NAME

yagg::Tutorial - Tutorial for yagg

=head1 SYNOPSIS

  # To use the generator
  ./yagg -m nonterminals.yg terminals.lg
  ./output/progs/generate 5

=head1 OVERVIEW

This tutorial will show you how to use B<yagg>, by way of two examples. In the
first example, we create a simple logical expression generator from scratch.
In the second example, we create a more sophisticated logical expression
generator from existing parser/lexer input files, such as those used by
YACC/Bison and LEX/FLEX.  These examples, plus another more sophisticated
fault tree generator are included with the distribution in the F<examples/>
directory.

It is assumed that the reader knows a little about formal grammars. Ideally,
the reader would have some experience writing grammars for input to parser
generators like YACC and Bison.

=head1 EXAMPLE: GENERATOR IMPLEMENTED FROM SCRATCH

Let's say that we want to create a logical expression generator that generates
expressions containing binary operators, parentheses, negation, and atomic
propositions. Examples might be I<-(p or q)>, I<p =E<gt> -q>, and
I<p and (q or r)>.

Here is a simple grammar that would create such expressions:

  wfe : wfe binary_operator wfe |
        "(" wfe ")" |
        unary_operator wfe |
        atomic ;
  binary_operator : "<=>" | "and" | "or" | "=>" ;
  unary_operator : "-" ;
  atomic : "p" | "q" | "r" ;

You can place the above grammar verbatim into a "language grammar" input file,
adding a line containing C<%%> at the top of the file. This is enough for the
generator to do its job. Running the generator as:

  $ yagg -r2 foo.yg
  
will generate the generator, compile it, and run it to create expressions of
length 2.

=head2 Input Files

Another way to write the input files is to put the productions for terminals
such as C<"("> and C<atomic> into a "terminal generator" file. This approach
is more similar to the approach taken by tools such as Lex and YACC, which
separate the lexical analysis of tokens from the grammar-based parsing. Here's
the resulting grammar file

Here's the grammar file, F<logical_expression.yg>:

  %%

  wfe : wfe BINARY_OPERATOR wfe |
        LEFT_PAREN wfe RIGHT_PAREN |
        UNARY_OPERATOR wfe |
        ATOMIC ;

Here we're using upper-case as a convention to indicate the terminals. Notice
that the strings associated with C<BINARY_OPERATOR>, C<UNARY_OPERATOR>, and
C<ATOMIC> have been removed. They are located in the terminal grammar file,
F<logical_expression.lg>:

  %%

  ( "<=>" | "and" | "or" | "=>" ) return BINARY_OPERATOR;

  "-" return UNARY_OPERATOR;

  "(" return LEFT_PAREN;

  ")" return RIGHT_PAREN;

  ( "p" | "q" | "r" ) return ATOMIC;

This file is similar to a LEX input file, but instead of regular expressions
you must provide one of several terminal formats. (We'll describe these
shortly.) For example, the last rules says that an atomic proposition is
either I<p>, I<q>, or I<r>.

=head2 Generating and Running the Generator

Next, use these input files to generate the generator:

  $ yagg -m logical_expression.yg logical_expression.lg

And then run the generator:

  $ ./output/progs/generator 3

You can use the B<-r> flag to run the generator for a given length. You can
also use the B<-o> flag to place the output in a different directory. B<-o>
can specify a directory on a remote machine using any of the B<rsync> formats.
In this case, B<-m> and B<-r> will execute on the remote machine instead of
the local one.

=head2 Terminal Generation Formats

You'll notice that the generator prints out a number of very similar
expressions, such as I<p and q> versus I<q and p>. You can control how
terminals are generated by using one of several terminal specification
formats:

=over

=item Simple String (C<"p">)

A simple string is used wherever the terminal appears.

=item Simple Alternation (C<( "p" | "q" )>)

When such a terminal appears, multiple versions will be generated, one for
each alternative.

=item Equivalence Alternation (C<[ "p" | "q" ]>)

Terminals of this type are treated as belonging to an equivalence class. That
is, a "q" is only generated if the current string already has a "p". This is
useful for terminals that represent things like variable names.

It's easier to explain equivalence with an example. Suppose you use the simple
alternation: C<"p" | "q"> for a terminal C<ATOMIC>. The output for
C<ATOMIC ATOMIC> will be:

  p p
  p q
  q p
  q q

In this case, the first and last generated string are basically the same
because they both have identical elements. Similarly, the second and third
generated string are the same since they both have different elements.
As far as our logical expressions example, we only really care about the first
two cases. We can use the equivalence alternation form to get what we want:
C<[ "p" | "q" ]>. In this case, the output will be:

  p p
  p q

=item Equivalence Generator (C<[ "p#" ]>)

This form is the same as the equivalence alternation, except that the names
are generated as they are needed by substituting "C<#>" in the string with a
number starting at 1, and the number of generated strings is unbounded. For
example, if you use C<[ "p#" ]> for C<ATOMIC>, the output for
C<ATOMIC ATOMIC> will be:

  p1 p1
  p1 p2

=back

If we want to remove the generation of strings with semantically redundant
atomic propositions, we would need to modify the C<ATOMIC> terminal
specification like so:

  [ "p" | "q" | "r" ] return ATOMIC;

If you now rerun the B<yagg> and string generator:

  $ yagg -r 3 logical_expression.yg logical_expression.lg

you will see that the strings have been constrained. Also note that the
generation is faster this time---the comilation process only needs to
recompile the implementation of the C<ATOMIC> generator component.

=head2 More Information

This example is in the F<examples/logical_expression_simple/> directory of the
distribution. See the C<README> file.

=head1 EXAMPLE: GENERATOR IMPLEMENTED FROM AN EXISTING PARSER

In this second example, let's say that you've already created a parser for
your format using LEX and YACC, or FLEX and Bison. If you have, you probably
already have the following:

=over

=item *

A F<lex.l> input file for LEX/FLEX

=item *

A F<yacc.y> input file for YACC/Bison

=item *

An include file for headers needed by the parser and lexer

=item *

Extra files for custom data types and whatnot needed for parsing

=back

The following instructions tell you how to convert them into input files for
B<yagg>. In this example, I will use the example parser located in the
F<examples/logical_expression_constrained/logical_expression_parser/>
directory of the distribution. (The F<.y> and F<.l> files are in the
F<src/logical_expression_parser> subdirectory.) The resulting files are in the
F<examples/logical_expression_constrained> directory, along with a F<README>
file.

(By the way, if you have problems or have to alter the steps while working
with your own files, please let me know so that I can update this tutorial.)

=head2 Prepare your terminal F<lex.lg> file

=over

=item 1.

Copy F<lex.l> to F<lex.lg>

=item 2.

Delete everything in the definitions section, except for an C<%{ ... %}> and
C<%option prefix=...>, if they are there.

=item 3.

In the rules section:

=over

=item 1.

Delete everything in the action blocks except any C<return TOKEN> statements

=item 2.

Delete any start conditions from the rules

=item 3.

Delete any rules with empty action blocks

=item 4.

Replace any regular expressions in the rules with appropriate generators. (See
above.)

=back

=item 4.

In the code header block, comment out the #include for the parser-generated
header file. (This would normally by F<y.tab.h>, but you may have renamed it.)

=item 5.

If you used the B<-Pfoo> flag to change the symbol prefix from I<yy>, you'll
need to add an C<%option prefix=foo>.

=back

=head2 Prepare your nonterminal F<yacc.yg> file

=over

=item 6.

Copy F<yacc.y> to F<yacc.yg>

=item 7.

For any rule that has a side-effect, add another "unaction" block immediately
following that reverses the effect. You don't need to worry about any changes
to C<$$>---they will be handled automatically. If you have an undo block, and
if any of the positional parameters such as C<$1> are pointers, you only need
to delete them if (1) they would normally be deleted in the action block, and
(2) if you actually use them in the unaction block.

If you compare the F<.yg> file to the original F<.y> file, you'll see that I
added only one undo block, because all the other action blocks only
manipulated C<$$>.

=item 8.

Any functions that you call should not have side effects. If they do, you'll
need to reverse them in the unaction block(s).

=back

=head2 Prepare your include file and additional files

=over

=item 9.

Create a directory like F<input_user_code_myproject/src/model>.

=item 10.

Put all your source code in that directory, being careful to create proper
subdirectories if you need them. For example, if you have
C<#include "module/header.h">, you'd want to put F<header.h> in
F<input_user_code_myproject/src/model/module>. (You don't however, need to
change the include to C<#include> at all. The default C++ extension is
.cc--you may need to rename your files. (Alternatively, you can copy the
GNUmakefile and edit the value of CCFILEEXT.)

=back

=head2 Generate the generator

=over

=item 11.

Run the following command to create the C++ generator code.

  ./yagg -f -u input_user_code_myproject yacc.yg lex.lg

=back

=head2 Build the generator

=over

=item 12.

By default, the output is placed in the F<output/> directory. Change to that
directory and run GNU make to compile it.

  cd output
  make

=back

You can do this automatically by giving B<yagg> the -m flag

=head2 Run the generator

=over

=item 13.

Generate the strings by running the F<generate> program in the F<progs/>
directory, giving it a positive integer for the length of the strings to
generate.

  ./progs/generate 9

=back

Specify any length for the generated strings you want. An error message will
be displayed if you choose too small a number.

=head2 More Information

This example is in the F<examples/logical_expression_constrained/> directory
of the distribution. See the F<README> file.

=head1 CONTROLLING GENERATED STRINGS

The most common problem you will encounter is exponential explosion of
strings.  If B<yagg> produces strings that you think it should
not, first check that the grammar actually says what you think it says. 

For example, consider this grammar, which generates strings of "a" and "b"
in any order (see F<ab.yg> in F<examples/ab>):

  node_list : node_list node | node ;
  node : "a" | "b" ;

Obviously we could move C<node> to the terminal file and use equivalence
classes to limit the generated strings. But for now, let's assume explore ways
to limit the strings using the grammar only.

One way is to force all "a"'s to precede all "b"'s. We can do this by modifying
the grammar (see F<ab_grammar_constrained.yg> in F<examples/ab>):

  node_list : a_list b_list | a_list | b_list ;
  a_list : a_list "a" | "a" ;
  b_list : b_list "b" | "b" ;

to change an unordered list of "a"'s and "b"'s so that "b"'s will follow
"a"'s.

Another approach might be to keep the same grammar but add context-sensitive
checks (see F<ab_check_constrained.yg> in F<examples/ab>). In order to do
this, we need to pass the names of the nodes back up the grammar productions,
which means that we need to define some return types and add action blocks:

  // Include some declarations that we need
  %{
  #include <list>
  #include <string>

  using namespace std;
  %}

  // Define the return types
  %union {
    list<string> text_list;
    string       text;
  }

  // Define the grammar production return types
  %type <text_list> node_list
  %type <text>      node

  %%

  node_list :
    node_list node
    {
      if ($$.size() > 0 && $$.back() == "b" && $2 == "a")
        yyerror("\"a\" can't follow \"b\"!");
      else
      {
        $$.push_back($2);
      }
    } |
    node
    {
      $$.push_back($1);
    };

  node :
    "a" { $$ = $1; } |
    "b" { $$ = $1; } ;

In this case, we check the node_list as we build it to make sure that we never
add an "a" after a "b". The call to C<yyerror> is how the generator knows that
the string is invalid. Note that I didn't need to write any unaction blocks
because the actions did not change any state other than C<$$>.

Which method is better? I would use the latter only for constraints which
truly are context sensitive. Restructuring the grammar also results in faster
string generation because there is no time wasted generating strings that will
only be invalidated later.

NOTE: this grammar takes advantage of a feature of the generator--that the
C<%union> isn't really used as a union, so you can define types of different
(and dynamic) sizes. Bad things would happen if you tried to use this grammar
with YACC or Bison.  To fix it, you would need to change the C<%union> types
to pointers and add C<new> and C<delete> to the action blocks, and change "a"
and "b" to tokens.

See F<ab_check_constrained_pointers.yg> in F<examples/ab> for these changes.
You can also compare this to the F<ab_parser.y> file for the parser in the
F<ab_parser> subdirectory.

For a more realistic example of limiting the generated strings, compare the
F<fault_trees> and F<fault_trees_constrained> examples in the F<examples>
directory of the distribution.

=head1 CONTROLLING OUTPUT FORMATTING

To control output, first create a F<user_code/src/progs/> directory. Run yagg
once so that it creates the F<output/> directory. Copy
F<output/src/progs/generate.cc> to F<user_code/src/progs/>. Modify the
C<Print_Strings> function, which inserts a space by default between strings
and adds a newline at the end.

Now, when you run yagg, add the B<-u user_code> option so that your modified
F<generate.cc> will override the normal generated one.  If you want more
control, you can remove the whitespace insertion in C<Print_Strings>, then
make whitespace an explicit part of your grammar.

=head1 AUTHOR

David Coppit <david@coppit.org>

=cut