The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

random_generator - randomly generate strings from a grammar

SYNOPSIS

  random_generator -u user_files_dir grammar.yg

DESCRIPTION

Given YACC-like and LEX-like input files, random_generator generates random strings of a user-specified length. If the grammar file has any action blocks, they are ignored. This means that some strings may be generated that would have been invalidated by context-sensitive checks implemented in the action blocks.

OPTIONS AND ARGUMENTS

.yg grammar file

This is the main grammar file, which is very similar (if not identical) to the YACC file that a programmer might write to parse an input file. There are one or two differences--see the section "INPUT FILE SYNTAX" below for details.

.lg terminal generator file

A terminal specification file that defines productions for nonterminals which represent tokens in the language grammar. This is analogous to, and replaces, the LEX file that a programmer might write to parse an input file. See the section "INPUT FILE SYNTAX" below for details.

This input file is not necessary if you use constant strings exclusively in your grammar (i.e. if the grammar has no terminals).

-d

Output debugging information to STDERR.

-l <length specification>

Generate strings of the specified length. The length specification is an operator followed by a number. For example, "=10", ">10", "<=10". Be sure to quote the specification to protect it from your shell.

-n <number>

Generate the specified number of strings and then stop. The value "-1" is interpreted as generating an infinite number of strings. This is the behavior of the program if -n is not specified.

--help

Print the usage message and exit.

--version

Print the version and exit.

INPUT FILE SYNTAX

This section provides a brief overview of the input file syntax. See the yagg::Tutorial for more discussion.

Language Grammar File

The language grammar file syntax is based on that of YACC. yagg should be able to process your .y file unchanged. (Please report a bug if it can not.) Then for any actions that have side-effects, you will need to add "unaction" blocks to reverse those side effects (described in more detail shortly). Otherwise, you should not have to make any other changes to the file.

A couple of things must be kept in mind. First, make very sure that your code does not contain memory leaks. While you may not notice them for a single execution of a parser, the generator will run your action code many times, in which case memory leaks will accrue. (Valgrind on Linux is a good leak checker.)

If your grammar actions have side effects, you must provide an unaction block that will reverse the side effects. This is because the generator needs to be able to backtrack in its search, and can't figure out how to undo your changes automatically. An example rule with an unaction block follows:

  // nonterminal and rule_1 have a return type of string*
  nonterminal : rule_1 "constant string"
  {
    // Normal action block
    global_count += $2;
    $$ = new string(*$1);
    delete $1;
  }
  {
    // New unaction block
    global_count -= $2; // <--- To restore global state
  };

First, notice that I am careful to delete the $1 positional parameter in the action block. Failing to do so would cause a memory leak. In the unaction block, I decrement the global_count variable. Note that you do not have to worry about deallocating or otherwise restoring the state of $$--that is handled automatically. Any positional parameters such as $2 that are used in the unaction block are automatically initialized with copies. (This means that any pointers to user-defined types must have a copy constructor defined.) Copies will not be made in the unaction block if you do not use a positional parameter, so you only need to delete $1 if you use it.

In an action block, any call to yyerror() is interpreted by the generator as a string that is invalid, and should not be generated. In the unaction block, m_error_occurred will be true if the action block resulted in an invalid string. Here's how you might use this to add a constraint on the generated strings:

  // nonterminal and rule_1 have a return type of string*
  nonterminal : rule_1 "constant string"
  {
    if (*$1 == "foo")
      yyerror("foo is not allowed!");
    else
    {
      global_count += $2;  // <--- Only increment for valid strings
      $$ = new string(*$1);
    }

    delete $1;
  }
  {
    if (!m_error_occurred) // <--- Only decrement for valid strings
      global_count -= $2;
  };

Terminal Generator File

The terminal generator file specifies the productions for terminals (which would be tokens in LEX). The generator supports a number of features for limiting the generation, as described below. The format is loosely based on that of LEX. The major change is that the only code that can be in the {...} blocks is a return statement for the token.

For obvious reasons, generating an unbounded number of possible strings for a regular expression is infeasible. Therefore, the programmer must provide one of several specifications for each terminal that tell the generator how to generate its strings.

Simple

The most simple specification is a constant string, which will replace the terminal wherever it appears in the generated string. For example:

  "=" return EQUAL;

You may also use constant strings in the language grammar file. They will be automatically replaced with "virtual terminals" having a simple string specification.

Alternation

If there are several possibilities for a terminal, you can use the syntax "(alt1|alt2)" to specify them. for example:

  ( "+" | "-" | "*" | "/" ) {
    return OPERATOR;
  }

This example also demonstrates an alternative form for the return statement. During generation, the OPERATOR terminal will be replaced with each of the alternatives, creating four times as many output strings.

Equivalence Classes

If an alternation is enclosed in square brackets, then the alternatives are considered to be interchangeable. This means that strings which differ only in terms of which alternative was chosen will not be printed. However, strings which utilize multiple alternatives will still be generated.

This is useful when generating terminals such as variable names:

  [ "x" | "y" ] return VARIABLE;

Consider a language grammar containing the following:

  SUM : VARIABLE "+" VARIABLE ;

Without the equivalence class, the following strings would be generated:

  x+x
  x+y
  y+x
  y+y

With the equivalence class, the following strings will be generated:

  x+x
  x+y

Since x and y are part of the same equivalence class, x+x is the same as y+y. Similarly, x+y is the same as y+x.

Equivalence Generators

If the terminal specification is an equivalence containing one literal string containing one "#" character, then the generator will create strings as needed, replacing the "#" character with 1, 2, 3, etc. Use "\#" if you want a literal "#" character in the produced string.

This is useful when generating an unlimited number of terminals such as variable names:

  [ "var_#" ] return VARIABLE;

Consider a language grammar containing the following:

  SUM : VARIABLE "+" VARIABLE ;

With the equivalence generator, the following strings will be generated:

  var_1+var_1
  var_1+var_2

You can think of this feature as an "infinite equivalence class".

AUTHOR

David Coppit, <david@coppit.org>, http://coppit.org/

SEE ALSO

Parse::Yapp, YACC, Bison, LEX, FLEX