The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

VENUE

Data::Rlist - A lightweight data language for Perl and C++

SYNOPSIS

    use Data::Rlist;

File and string I/O for any Perl data $thing:

    ### Compile data as text.

                  WriteData $thing, $filename;  # compile data into file
                  WriteData $thing, \$string;   # compile data into buffer
    $string_ref = WriteData $thing;             # dto.

    $string     = OutlineData $thing;           # compile printable text
    $string     = StringizeData $thing;         # compile text in a compact form (no newlines)
    $string     = SqueezeData $thing;           # compile text in a super-compact form (no whitespace)

    ### Parse data from text.

    $thing      = ReadData $filename;           # parse data from file
    $thing      = ReadData \$string;            # parse data from string buffer

"ReadData", "WriteData" etc. are auto-exported functions. Alternately we use:

    ### Qualified functions to parse text.

    $thing      = Data::Rlist::read($filename);
    $thing      = Data::Rlist::read($string_ref);
    $thing      = Data::Rlist::read_string($string_or_string_ref);

    ### Qualified functions to compile data into text.

                  Data::Rlist::write($thing, $filename);
    $string_ref = Data::Rlist::write_string($thing);
    $string     = Data::Rlist::write_string_value($thing);

    ### Print data to STDOUT.

    PrintData $thing;

The object-oriented interface:

    ### For objects the '-output' attribute refers to a string buffer or is a filename.
    ### The '-data' attribute defines the value or reference to be compiled into text.

    $object     = new Data::Rlist(-data => $thing, -output => \$target)

    $string_ref = $object->write;           # compile into $target, return \$target
    $string_ref = $object->write_string;    # compile into new string ($target not touched)
    $string     = $object->write_string_value; # dto. but return string value

    ### Print data to STDOUT.

    print $object->write_string_value;
    print ${$object->write};                # returns \$target

    ### Set output file and write $thing to disk.

    $object->set(-output => ".foorc");

    $object->write;                         # write "./.foorc", return 1
    $object->write(".barrc");               # write "./.barrc" (the filename overrides -output)

    ### The '-input' attribute defines the text to be compiled, either as
    ### string reference or filename.

    $object->set(-input => \$input_string); # assign some text

    $thing      = $object->read;            # parse $input_string into Perl data
    $thing      = $object->read($other_string); # parse $other_string (the argument overrides -input)

    $object->set(-input => ".foorc");       # assign some input file

    $foorc      = $object->read;            # parse ".foorc"
    $barrc      = $object->read(".barrc");  # parse some other file
    $thing      = $object->read(\$string);  # parse some string buffer
    $thing      = $object->read_string($string_or_ref); # dto.

Create deep-copies of any Perl data. The metaphor "keelhaul" vividly connotes that $thing is stringified, then compiled back:

    ### Compile a value or ref $thing into text, then parse back into data.

    $reloaded   = KeelhaulData $thing;
    $reloaded   = Data::Rlist::keelhaul($thing);

    $object     = new Data::Rlist(-data => $thing);
    $reloaded   = $object->keelhaul;

Do deep-comparisons of any Perl data:

    ### Deep-compare $a and $b and get a description of all type/value differences.

    @diffs      = CompareData($a, $b);

For more information see "compile", "keelhaul", and "deep_compare".

DESCRIPTION

Venue

Random-Lists (Rlist) is a tag/value text format, which can "stringify" any data structure in 7-bit ASCII text. The basic types are lists and scalars. The syntax is similar, but not equal to Perl's. For example,

    ( "hello", "world" )
    { "hello" = "world"; }

designates two lists, the first of which is sequential, the second associative. The format...

- allows the definition of hierachical and constant data,

- has no user-defined types, no keywords, no variables,

- has no arithmetic expressions,

- uses 7-bit-ASCII character encoding and escape sequences,

- uses C-style numbers and strings,

- has an extremely minimal syntax implementable in any programming language and system.

You can write any Perl data structure into files as legible text. Like with CSV the lexical overhead of Rlist is minimal: files are merely data.

You can read compiled texts back in Perl and C++ programs. No information will be lost between different program languages, and floating-point numbers keep their precision.

You can also compile structured CSV text from Perl data, using special functions from this package that will keep numbers precise and properly quote strings.

Since Rlist has no user-defined types the data is structured out of simple scalars and lists. It is conceivable, however, to develop a simple type system and store type information along with the actual data. Otherwise the data structures are tacit consents between the users of the data. See also the implemenation notes for "Perl" and "C++".

Character Encoding

Rlist text uses the 7-bit-ASCII character set. The 95 printable character codes 32 to 126 occupy one character. Codes 0 to 31 and 127 to 255 require four characters each: the \ escape character followed by the octal code number. For example, the German Umlaut character ü (252) is translated into \374. An exception are the following codes:

    ASCII               ESCAPED AS
    -----               ----------
      9 tab               \t
     10 linefeed          \n
     13 return            \r
     34 quote     "       \"
     39 quote     '       \'
     92 backslash \       \\

Values and Default Values

Values are either scalars, array elements or the value of a pair. Each value is constant.

The default scalar value is the empty string "". So in Perl undef is compiled into "".

Numbers, Strings and Here-Documents

Numbers constants adhere to the IEEE 754 syntax for integer- and floating-point numbers (i.e., the same lexical conventions as in C and C++ apply).

Strings constants consisting only of [a-zA-Z_0-9-/~:.@] characters "look like identifiers" (aka symbols) need not to be quoted. Otherwise string constants follow the C language lexicography. They strings must be placed in double-quotes (single-quotes are not allowed). Quoted strings are also escaped (i.e., characters are converted to the input character set of 7-bit ASCII).

You can define a string using a line-oriented form of quoting based on the UNIX shell here-document syntax and RFC 111. Multiline quoted strings can be expressed with

    <<DELIMITER

Following the sigil << an identifier specifies how to terminate the string scalar. The value of the scalar will be all lines following the current line down to the line starting with the delimiter (i.e., the delimiter must be at column 1). There must be no space between the sigil and the identifier.

EXAMPLES

Quoted strings:

    "Hello, World!"

Unquoted strings (symbols, identifiers):

    foobar   cogito.ergo.sum   Memento::mori

Here-document strings:

    <<hamlet
    "This above all: to thine own self be true". - (Act I, Scene III).
    hamlet

Integegers and floats:

    38   10e-6   -.7   3.141592653589793

For more information see "is_symbol", "is_number" and "escape7".

List Values

We have two types of lists: sequential (aka array) and associative (aka map, hash, dictionary).

EXAMPLES

Arrays:

    ( 1, 2, ( 3, "Audiatur et altera pars!" ) )

Maps:

    {
        key = value;
        standalone-key;
        Pi = 3.14159;

        "meta-syntactic names" = (foo, bar, "lorem ipsum", Acme, ___);

        var = {
            log = {
                messages = <<LOG;
    Nov 27 21:55:04 localhost kernel: TSC appears to be running slowly. Marking it as unstable
    Nov 27 22:34:27 localhost kernel: Uniform CD-ROM driver Revision: 3.20
    Nov 27 22:34:27 localhost kernel: Loading iSCSI transport class v2.0-724.<6>PNP: No PS/2 controller found. Probing ports directly.
    Nov 27 22:34:27 localhost kernel: wifi0: Atheros 5212: mem=0x26000000, irq=11
    LOG
            };
        };
    }

Binary Data

Binary data can be represented as base64-encoded string, or here-document string. For example,

    use MIME::Base64;

    $str = encode_base64($binary_buf);

The result $str will be a string broken into lines of no more than 76 characters each; the 76th character will be a newline "\n". Here is a complete Perl program that creates a file random.rls:

    use MIME::Base64;
    use Data::Rlist;

    our $binary_data = join('', map { chr(int rand 256) } 1..300);
    our $sample = { random_string => encode_base64($binary_data) };

    WriteData $sample, 'random.rls';

These few lines create a file random.rls containing text like the following:

    {
        random_string = <<___
    w5BFJIB3UxX/NVQkpKkCxEulDJ0ZR3ku1dBw9iPu2UVNIr71Y0qsL4WxvR/rN8VgswNDygI0xelb
    aK3FytOrFg6c1EgaOtEudmUdCfGamjsRNHE2s5RiY0ZiaC5E5XCm9H087dAjUHPtOiZEpZVt3wAc
    KfoV97kETH3BU8/bFGOqscCIVLUwD9NIIBWtAw6m4evm42kNhDdQKA3dNXvhbI260pUzwXiLYg8q
    MDO8rSdcpL4Lm+tYikKrgCih9UxpWbfus+yHWIoKo/6tW4KFoufGFf3zcgnurYSSG2KRLKkmyEa+
    s19vvUNmjOH0j1Ph0ZTi2pFucIhok4krJi0B5yNbQStQaq23v7sTqNom/xdRgAITROUIoel5sQIn
    CqxenNM/M4uiUBV9OhyP
    ___
    ;
    }

Note that "WriteData" uses the predefined "default" configuration, which enables here-doc strings. See also MIME::Base64.

Embedded Perl Code (Nanoscripts)

Rlist text can define embedded Perl programs, called nanonscripts. The embedded program text has the form of a here-document with the special delimiter "perl". After the Rlist text has been parsed you call "evaluate_nanoscripts" to eval all embedded Perl in the order of definiton. The function arranges it that within the eval...

  • the $root variable refers to the root of the input, as unblessed array- or hash-reference;

  • the $this variable refers to the array or hash that stores the currently eval'd nanoscript;

  • the $where variable stores the name of the key, or the index, within $this.

The nanoscript can use this information to oriented itself within the parsed data, or even to modify the data in-place. The result of eval'ing will replace the nanoscript text. You can also eval the embedded Perl codes programmatically, using the "nanoscripts" and "result" functions.

EXAMPLES

Simple example of an Rlist text that hosts Perl code:

    (<<perl)
    print "Hello, World!";
    perl

Here is a more complex example that defines a list of nanoscripts, and evaluates them:

    use Data::Rlist;

    $data = join('', <DATA>);
    $data = EvaluateData \$data;

    __END__
    ( <<perl, <<perl, <<perl, <<perl )
    print "Hello World!\n"          # english
    perl
    print "Hallo Welt!\n"           # german
    perl
    print "Bonjour le monde!\n"     # french
    perl
    print "Olá mundo!\n"            # spanish
    perl

When we execute the above script the following output is printed before the script exits:

    Hello World!
    Hallo Welt!
    Bonjour le monde!
    Olá mundo!

Note that when the Rlist text after __END__ is placed in some_file, we can call "EvaluateData("some_file")" for the same effect. The next example modifies the parsed data in place. Imagine a file this_file_modifies_itself with the following content:

    ( <<perl )
    ReadData(\\'{ foo = bar; }');
    perl

When we parse this file using

    $data = ReadData("this_file_modifies_itself");

to $data will be assigned the following Perl value

    [ "ReadData(\\'{ foo = bar; }');\n" ]

Next we call Data::Rlist::"evaluate_nanoscripts"() to "morph" this value into

    [ { 'foo' => 'bar' } ]

The same effect can be achieved in just one call

    $data = EvaluateData("this_file_modifies_itself");

Comments

Rlist supports multiple forms of comments: // or # single-line-comments, and /* */ multi-line-comments. You may use all three forms at will.

PACKAGE INTERFACE

The core functions to cultivate package objects are "new", "dock", "set" and "get". When a regular package function is called in object context some omitted arguments are read from object attributes. This is true for the following functions: "read", "write", "read_string", "write_string", "read_csv", "write_csv", "read_conf", "write_conf" and "keelhaul".

Unless called in object context the first argument has an indifferent meaning (i.e., it is no Data::Rlist reference). Then "read" expects an input file or string, "write" the data to compile etc.

Construction

new([ATTRIBUTES])

Create a Data::Rlist object from the hash ATTRIBUTES. For example,

    $self = Data::Rlist->new(-input => 'this.dat',
                             -data => $thing,
                             -output => 'that.dat');

For this object the call $self->read() reads from this.dat, and $self->write() writes any Perl data $thing to that.dat.

REGULAR OBJECT ATTRIBUTES

-input => INPUT
-filter => FILTER
-filter_args => FILTER-ARGS

Defines what Rlist text to parse and how to preprocess an input file. INPUT is a filename or string reference. FILTER can be 1 to select the standard C preprocessor cpp. These attributes are applied by "read", "read_string", "read_conf" and "read_csv".

-data => DATA
-options => OPTIONS
-output => OUTPUT

Defines the Perl data to be compiled into text (DATA), how it shall be compiled (OPTIONS) and where to store the compiled text (OUTPUT). When OUTPUT is string reference the compiled text will be stored in that string. When OUTPUT is undef a new string is created. When OUTPUT is a string value it is a filename. These attributes are applied by "write", "write_string", "write_conf", "write_csv" and "keelhaul".

-header => HEADER

Defines an array of text lines, each of which will by prefixed by a # and then written at the top of the output file.

-delimiter => DELIMITER

Defines the field delimiter for .csv-files. Applied by "read_csv" and "read_conf".

-columns => STRINGS

Defines the column names for .csv-files to be written into the first line.

ATTRIBUTES THAT MASQUERADE PACKAGE GLOBALS

The attributes listed below raise new values for package globals for the time an object method runs.

-InputRecordSeparator => FLAG

Masquerades $/, which affects how lines are read and written to and from Rlist- and CSV-files. You may also set $/ by yourself. See perlport and perlvar.

-MaxDepth => INTEGER
-SafeCppMode => FLAG
-RoundScientific => FLAG

Masquerade $Data::Rlist::MaxDepth, $Data::Rlist::SafeCppMode and $Data::Rlist::RoundScientific.

-EchoStderr => FLAG

Print read errors and warnings message on STDERR (default: off).

-DefaultCsvDelimiter => REGEX
-DefaultConfDelimiter => REGEX

Masquerades $Data::Rlist::DefaultCsvDelimiter and $Data::Rlist::DefaultConfDelimiter. These globals define the default regexes to use when the -options attribute does not specifiy the "delimiter" regex. Applied by "read_csv" and "read_conf".

-DefaultConfSeparator => STRING

Masquerades $Data::Rlist::DefaultConfSeparator, the default string to use when the -options attribute does not specifiy the "separator" string. Applied by "write_conf".

dock(SELF, SUB)

Localize object SELF within the package and run SUB. This means that some of SELF's attribute masqquerade few package globals for the time SUB runs. SELF then locks the package, and $Data::Rlist::Locked is greater than 0.

Attribute Access

set(SELF[, ATTRIBUTE]...)

Reset or initialize object attributes, then return SELF. Each ATTRIBUTE is a name/value-pair. See "new" for a list of valid names. For example,

    $obj->set(-input => \$str, -output => 'temp.rls', -options => 'squeezed');
get(SELF, NAME[, DEFAULT])
require(SELF[, NAME])
has(SELF[, NAME])

Get some attribute NAME from object SELF. Unless NAME exists returns DEFAULT. The require method has no default value, hence it dies unless NAME exists. has returns true when NAME exists, false otherwise. For NAME the leading hyphen is optional. For example,

    $self->get('foo');          # returns $self->{-foo} or undef
    $self->get(-foo=>);         # dto.
    $self->get('foo', 42);      # returns $self->{-foo} or 42

Public Functions

read(INPUT[, FILTER, FILTER-ARGS])

Parse data from INPUT, which specifies some Rlist-text. See also "errors", "write".

PARAMETERS

INPUT shall be either

- some Rlist object created by "new",

- a string reference, in which case read and "read_string" parse Rlist text from it,

- a string scalar, in which case read assumes a file to parse.

See "open_input" for the FILTER and FILTER-ARGS parameters, which are used to preprocess an input file. When an input file cannot be open'd and flock'd this function dies. When INPUT is an object, arguments for FILTER and FILTER-ARGS eventually override the -filter and -filter_args attributes.

RESULT

The parsed data as array- or hash-reference, or undef if there was no data. The latter may also be the case when file consist only of comments/whitespace.

NOTES

This function may die. Dying is Perl's mechanism to raise exceptions, which eventually can be catched with eval. For example,

    my $host = eval { use Sys::Hostname; hostname; } || 'some unknown machine';

This code fragment traps the die exception, so that eval returns undef or the result of calling hostname. The following example uses eval to trap exceptions thrown by read:

    $object = new Data::Rlist(-input => $thingfile);
    $thing = eval { $object->read };

    unless (defined $thing) {
        if ($object->errors) {
            print STDERR "$thingfile has syntax errors"
        } else {
            print STDERR "$thingfile not found, is locked or empty"
        }
    } else {
        # Can use $thing
            .
            .
    }
read_csv(INPUT[, OPTIONS, FILTER, FILTER-ARGS])
read_conf(INPUT[, OPTIONS, FILTER, FILTER-ARGS])

Parse data from INPUT, which specifies some comma-separated-values (CSV) text. Both functions

- read data from strings or files,

- use an optional delimiter,

- ignore delimiters in quoted strings,

- ignore empty lines,

- ignore lines begun with #.

read_conf is a variant of read_csv dedicated to configuration files. Such files consist of lines of the form

    key = value

PARAMETERS

For INPUT see "read". For FILTER, FILTER-ARGS see "open_input".

OPTIONS can be used to override the "delimiter" regex. For example, a delimiter of '\s+' splits the line at horizontal whitespace into multiple values (with respect of quoted strings). For read_csv the delimiter defaults to '\s*,\s*', and for read_conf to '\s*=\s*'. See also "write_csv" and "write_conf".

RESULT

Both functions return a list of lists. Each embedded array defines the fields in a line.

EXAMPLES

Un/quoting of values happens implicitly. Given a file db.conf

    # Comment
    SERVER      = hostname
    DATABASE    = database_name
    LOGIN       = "user,password"

the call $opts=ReadConf("db.conf") assigns

    [ [ 'SERVER', 'hostname' ],
      [ 'DATABASE', 'database_name' ],
      [ 'LOGIN', 'user,password' ]
    ]

The "WriteConf" function can be used to create or update the configuration:

    push @$opts, [ 'MAGIC VALUE' => 3.14_15 ];

    WriteConf('db.conf', { precision => 2 });

This writes to db.conf:

    SERVER = hostname
    DATABASE = database_name
    LOGIN = "user,password"
    "MAGIC VALUE" = 3.14
read_string(INPUT)

Calls "read" to parse Rlist language productions from the string or string-reference INPUT. When INPUT is an object do this for its -input attribute.

result([SELF])

Return the last result of calling "read", which is either undef or some array- or hash-reference. When SELF is passed as object reference, returns the result that occured the last time SELF had called "read".

nanoscripts([SELF])

In list context return an array of nanoscripts defined by the last call to "read". When SELF is passed return this information for the last time SELF had called "read". The result has the form:

    ( [ $hash_or_array_ref, $key_or_index ], # 1st nanoscript
      [ $hash_or_array_ref, $key_or_index ], # 2nd nanoscript
        .
        .
        .
    )

In scalar context return a reference to the above. This information defines the location of all embedded Perl scripts within the result, and can be used to eval them programmatically. See also "result", "evaluate_nanoscripts".

evaluate_nanoscripts([SELF])

Evaluates all nanoscripts defined by the last call to "read". When called as method evaluates the nanoscripts defined by the last time SELF had called "read". Returns the number of scripts or 0 if none were available. Each script is replaced by the result of eval'ing it. (For details and examples see "Embedded Perl Code (Nanoscripts)".)

messages([SELF])

In list context returns a list of compile-time messages that occurred in the last call to "read". In scalar context returns an array reference. When an package object SELF is passed returns the information for the last time SELF had called "read".

errors([SELF])
warnings([SELF])

Returns the number of syntax errors and warnings that occurred in the last call to "read". When called as method returns the number that occured the last time SELF had called "read".

Example:

    use Data::Rlist;

    our $data = ReadData 'things.rls';

    if (Data::Rlist::errors() || Data::Rlist::warnings()) {
        print join("\n", Data::Rlist::messages())
    } else {
        # Ok, $data is an array- or hash-reference.
        die unless $data;

    }
broken([SELF])

Returns the number of times the last "compile" violated $Data::Rlist::MaxDepth. When called as method returns the information for the last time SELF had called "compile".

missing_input([SELF])

Returns true when the last call to "parse" yielded undef, because there was nothing to parse. When called as method returns the information for the last time SELF had called "parse".

write(DATA[, OUTPUT, OPTIONS, HEADER])

Transliterates Perl data into Rlist text and write the text to a file or string buffer. write is auto-exported as "WriteData".

PARAMETERS

DATA is either an object generated by "new", or any Perl data including undef. In case of an object the actual DATA value is defined by its -data attribute. (When -data refers to another Rlist object, this other object is invoked.)

OUTPUT defines the output location, as filename, string-reference or undef. When undef the function allocates a string and returns a reference to it. OUTPUT defaults to the -output attribute when DATA defines an object.

OPTIONS define how to compile DATA: when undef or "fast" uses "compile_fast", when "perl" uses "compile_Perl", otherwise "compile". Defaults to the -options attribute when DATA is an object.

HEADER is a reference to an array of strings that shall be printed literally at the top of an output file. Defaults to the -header attribute when DATA is an object.

RESULT

When write creates a file it returns 0 for failure or 1 for success. Otherwise it returns a string reference.

EXAMPLES

    $self = new Data::Rlist(-data => $thing, -output => $output);

    $self->write;   # Compile $thing into a file ($output is a filename)
                    # or string ($output is a string reference).

    Data::Rlist::write($thing, $output);    # dto., but using the functional interface.
write_csv(DATA[, OUTPUT, OPTIONS, COLUMNS])
write_conf(DATA[, OUTPUT, OPTIONS, HEADER])

Write DATA as comma-separated-values (CSV) to file or string OUTPUT. write_conf writes configuration files where each line contains a tagname, a separator and a value.

PARAMETERS

DATA is either an object, or defines the data to be compiled as reference to an array of arrays. write_conf uses only the first and second fields. For example,

    [ [ a, b, c ],      # fields of line 1
      [ d, e, f, g ],   # fields line 2
        .
        .
    ]

OPTIONS specifies the comma-separator ("separator"), how to quote ("auto_quote"), the linefeed ("eol_space") and the numeric precision ("precision"). COLUMNS specifies the column names to be written to the first line. Likewise the text from the HEADER array is written in form of #-comments at the top of an output file.

RESULT

When a file was created both function return 0 for failure, or 1 for success. Otherwise they return a reference to the compiled text.

EXAMPLES

Functional interface:

    use Data::Rlist;            # imports WriteCSV

    WriteCSV($thing, "foo.dat");

    WriteCSV($thing, "foo.dat", { separator => '; ' }, [qw/GBKNR VBKNR EL LaD/]);

    WriteCSV($thing, \$target_string);

    $string_ref = WriteCSV($thing);

Object-oriented interface:

    $object = new Data::Rlist(-data => $thing, -output => "foo.dat",
                              -options => { separator => '; ' },
                              -columns => [qw/GBKNR VBKNR EL LaD LaD_V/]);

    $object->write_csv;         # write $thing as CSV to foo.dat
    $object->write;             # write $thing as Rlist to foo.dat

    $object->set(-output => \$target_string);

    $object->write_csv;         # write $thing as CSV to $target_string

See also "write" and "read_csv".

write_string(DATA[, OPTIONS])

Stringify any Perl data and return a reference to the string. Works like "write" but always compiles to a new string to which it returns a reference. The default for OPTIONS will be "string".

write_string_value(DATA[, OPTIONS])

Stringify any Perl dats and return the compiled text string value. OPTIONS default to "default". For example,

    print "\n\$thing dumped: ", Data::Rlist::write_string_value($thing);

    $self = new Data::Rlist(-data => $thing);

    print "\nsame \$thing dumped: ", $self->write_string_value;
keelhaul(DATA[, OPTIONS])

Do a deep copy of DATA according to OPTIONS. First the function compiles DATA to Rlist text, then restores the data from exactly this text. This process is called "keelhauling data", and allows us to

- adjust the accuracy of numbers,

- break circular-references,

- drop \*foo{THING}s,

- bring multiple data sets to the same, common basis.

It is useful (e.g.) when DATA had been hatched by some other code, and you don't know whether it is hierachical, or if typeglob-refs nist inside. Then keelhaul it to clean it from its past. For example, to bring all numbers in

    $thing = { foo => [ [ .00057260 ], -1.6804e-4 ] };

to a certain accuracy, use

    $deep_copy_of_thing = Data::Rlist::keelhaul($thing, { precision => 4 });

All number scalars in $thing are rounded to 4 decimal places, so they're finally comparable as floating-point numbers. To $deep_copy_of_thing is assigned the hash-reference

    { foo => [ [ 0.0006 ], -0.0002 ] }

Likewise one can convert all floats to integers:

    $make_integers = new Data::Rlist(-data => $thing, -options => { precision => 0 });

    $thing_without_floats = $make_integers->keelhaul;

When "keelhaul" is called in an array context it also returns the text from which the copy had been built. For example,

    $deep_copy = Data::Rlist::keelhaul($thing);

    ($deep_copy, $rlist_text) = Data::Rlist::keelhaul($thing);

    $deep_copy = new Data::Rlist(-data => $thing)->keelhaul;

DETAILS

"keelhaul" won't throw die nor return an error, but be prepared for the following effects:

  • ARRAY, HASH, SCALAR and REF references were compiled, whether blessed or not. (Since compiling does not store type information, keelhaul will turn blessed references into barbars again.)

  • IO, GLOB and FORMAT references have been converted into strings.

  • Depending on the compile options, CODE references are invoked, deparsed back into their function bodies, or dropped.

  • Depending on the compile options floats are rounded, or are converted to integers.

  • undef'd array elements are converted into the default scalar value "".

  • Unless $Data::Rlist::MaxDepth is 0, anything deeper than $Data::Rlist::MaxDepth will be thrown away.

  • When the data contains objects, no special methods are triggered to "freeze" and "thaw" the objects.

See also "compile" and "deep_compare"

Static Functions

predefined_options([PREDEF-NAME])

Return are predefined hash-reference of compile otppns. PREDEF-NAME defaults to "default".

complete_options([OPTIONS[, BASICS]])

Completes OPTIONS with BASICS, so that all pairs not already in OPTIONS are copied from BASICS. Always returns a new hash-reference, i.e., neither OPTIONS nor BASICS are modified. Both arguments define hashes or some predefined options name. BASICS defaults to "default". For example,

    $options = complete_options({ precision => 0 }, 'squeezed')

merges the predefined options for "squeezed" text with a numeric precision of 0 (converts all floats to integers).

Implementation Functions

open_input(INPUT[, FILTER, FILTER-ARGS])
close_input

Open/close Rlist text file or string INPUT for parsing. Used internally by "read" and "read_csv".

PREPROCESSING

The function can preprocess the INPUT file using FILTER. Use the special value 1 to select the default C preprocessor (gcc -E -Wp,-C). FILTER-ARGS is an optional string of additional command-line arguments to be appended to FILTER. For example,

    my $foo = Data::Rlist::read("foo", 1, "-DEXTRA")

eventually does not parse foo, but the output of the command

    gcc -E -Wp,-C -DEXTRA foo

Hence within foo now C-preprocessor-statements are allowed. For example,

    {
    #ifdef EXTRA
    #include "extra.rlist"
    #endif

        123 = (1, 2, 3);
        foobar = {
            .
            .

SAFE CPP MODE

This mode uses sed and a temporary file. It is enabled by setting $Data::Rlist::SafeCppMode to 1 (the default is 0). It protects single-line #-comments when FILTER begins with either gcc, g++ or cpp. "open_input" then additionally runs sed to convert all input lines beginning with whitespace plus the # character. Only the following cpp-commands are excluded, and only when they appear in column 1:

- #include and #pragma

- #define and #undef

- #if, #ifdef, #else and #endif.

For all other lines sed converts # into ##. This prevents the C preprocessor from evaluating them. Because of Perl's limited open function, which isn't able to dissolve long pipes, the invocation of sed requires a temporary file. The temporary file is created in the same directory as the input file. When you only use // and /* */ comments, however, this read mode is not required.

lex()

Lexical scanner. Called by "parse" to split the current line into tokens. lex reads # or // single-line-comment and /* */ multi-line-comment as regular white-spaces. Otherwise it returns tokens according to the following table:

    RESULT      MEANING
    ------      -------
    '{' '}'     Punctuation
    '(' ')'     Punctuation
    ','         Operator
    ';'         Punctuation
    '='         Operator
    'v'         Constant value as number, string, list or hash
    '??'        Error
    undef       EOF

lex appends all here-doc-lines with a newline character. For example,

        <<test1
        a
        b
        test1

is effectively read as "a\nb\n", which is the same value as the equivalent here-doc in Perl has. So, not all strings can be encoded as a here-doc. For example, it might not be quite obvious to many programmers that "foo\nbar" cannot be expressed as here-doc.

lexln()

Read the next line of text from the current input. Return 0 if "at_eof", otherwise return 1.

at_eof()

Return true if current input file/string is exhausted, false otherwise.

parse()

Read Rlist language productions from current input. This is a fast, non-recursive parser driven by the parser map %Data::Rlist::Rules, and fed by "lex". It is called internally by "read". parse returns an array- or hash-reference, or undef in case of parsing "errors".

compile(DATA[, OPTIONS, FH])

Build Rlist text from DATA:

  • Reference-types SCALAR, HASH, ARRAY and REF are compiled into text, whether blessed or not.

  • Reference-types CODE are compiled depending on the "code_refs" setting in OPTIONS.

  • Reference-types GLOB (typeglob-refs), IO and FORMAT (file- and directory handles) cannot be dissolved, and are compiled into the strings "?GLOB?", "?IO?" and "?FORMAT?".

  • undef'd values in arrays are compiled into the default Rlist "".

When FH is defined compile directly to this file and return 1. Otherwise build a string and return a reference to it. This is the compilation function called when the OPTIONS argument passed to "write" is not omitted, and is not "fast" or "perl".

compile_fast(DATA)

Build Rlist text from DATA, as fast as actually possible with pure Perl:

  • Reference-types SCALAR, HASH, ARRAY and REF are compiled into text, whether blessed or not.

  • CODE, GLOB, IO and FORMAT are compiled into the strings "?CODE?", "?IO?", "?GLOB?" and "?FORMAT?".

  • undef'd values in arrays are compiled into the default Rlist "".

"compile_fast" is the default compilation function. It is called when you pass undef or "fast" in place of the OPTIONS parameter (see "write", "write_string"). Since "compile_fast" considers no compile options it will not call code, round numbers, detect self-referential data etc. Also "compile_fast" always compiles into a unique package variable to which it returns a reference.

compile_Perl(DATA)

Like "compile_fast", but do not compile Rlist text - compile DATA into Perl syntax. It can then be eval'd. This renders more compact, and more exact output as Data::Dumper. For example, only strings are quoted. To enable this compilation function pass "perl" to as the OPTIONS argument, or set the -options attribute of package objects to this string.

Auxiliary Functions

The utility functions in this section are generally useful when handling stringified data. Internally "quote7", "escape7", "is_integer" etc. apply precompiled regexes and precomputed ASCII tables. "split_quoted" and "parse_quoted" simplify "Text::ParseWords". "round" and "equal" are working solutions for floating-point numbers. "deep_compare" is a smart function to "diff" two Perl variables. All these functions are very fast and mature.

is_integer(SCALAR-REF)

Returns true when a scalar looks like a positive or negative integer constant. The function applies the compiled regex $Data::Rlist::REInteger.

is_number(SCALAR-REF)

Test for strings that look like numbers. is_number can be used to test whether a scalar looks like a integer/float constant (numeric literal). The function applies the compiled regex $Data::Rlist::REFloat. Note that it doesn't match

- leading or trailing whitespace,

- lexical conventions such as the "0b" (binary), "0" (octal), "0x" (hex) prefix to denote a number-base other than decimal, and

- Perls' legible numbers, e.g. 3.14_15_92,

- the IEEE 754 notations of Infinite and NaN.

See also

    $ perldoc -q "whether a scalar is a number"
is_symbol(SCALAR-REF)

Test for symbolic names. is_symbol can be used to test whether a scalar looks like a symbolic name. Such strings need not to be quoted. Rlist defines symbolic names as a superset of C identifier names:

    [a-zA-Z_0-9]                    # C/C++ character set for identifiers
    [a-zA-Z_0-9\-/\~:\.@]           # Rlist character set for symbolic names

    [a-zA-Z_][a-zA-Z_0-9]*                  # match C/C++ identifier
    [a-zA-Z_\-/\~:@][a-zA-Z_0-9\-/\~:\.@]*  # match Rlist symbolic name

For example, names such as std::foo, msg.warnings, --verbose, calculation-info need not be quoted.

is_value(SCALAR-REF)

Returns true when a scalar is an integer, a number, a symbolic name or some quoted string.

is_random_text(SCALAR-REF)

The opposite of "is_value". Such scalars will be turned into quoted strings by "compile" and "compile_fast".

quote7(TEXT)
escape7(TEXT)

Converts TEXT into 7-bit-ASCII. All characters not in the set of the 95 printable ASCII characters are escaped. The following ASCII codes will be converted to escaped octal numbers, i.e. 3 digits prefixed by a slash:

    0x00 to 0x1F
    0x80 to 0xFF
    " ' \

The difference between the two functions is that quote7 additionally places TEXT into double-quotes. For example, quote7(qq'"Früher Mittag\n"') returns "\"Fr\374her Mittag\n\"", while escape7 returns \"Fr\374her Mittag\n\"

maybe_quote7(TEXT)

Return quote7(TEXT) if "is_random_text"(TEXT); otherwise (TEXT defines a symbolic name or number) return TEXT.

maybe_unquote7(TEXT)

Return unquote7(TEXT) when TEXT is enclosed by double-quotes; otherwise returns TEXT.

unquote7(TEXT)
unescape7(TEXT)

Reverses what "quote7" and "escape7" did with TEXT.

unhere(HERE-DOC-STRING[, COLUMNS, FIRSTTAB, DEFAULTTAB])

Combines recipes 1.11 and 1.12 from the Perl Cookbook. HERE-DOC-STRING shall be a here-document. The function checks whether each line begins with a common prefix, and if so, strips that off. If no prefix it takes the amount of leading whitespace found the first line and removes that much off each subsequent line.

Unless COLUMNS is defined returns the new here-doc-string. Otherwise, takes the string and reformats it into a paragraph having no line more than COLUMNS characters long. FIRSTTAB will be the indent for the first line, DEFAULTTAB the indent for every subsequent line. Unless passed, FIRSTTAB and DEFAULTTAB default to the empty string "".

split_quoted(INPUT[, DELIMITER])
parse_quoted(INPUT[, DELIMITER])

Divide the string INPUT into a list of strings. DELIMITER is a regular expression specifying where to split (default: '\s+'). The functions won't split at DELIMITERs inside quotes, or which are backslashed.

parse_quoted works like split_quoted but additionally removes all quotes and backslashes from the splitted fields. Both functions effectively simplify the interface of Text::ParseWords. In an array context they return a list of substrings, otherwise the count of substrings. An empty array is returned in case of unbalanced double-quotes, e.g. split_quoted('foo,"bar').

EXAMPLES

    sub split_and_list($) {
        print ($i++, " '$_'\n") foreach split_quoted(shift)
    }

    split_and_list(q("fee foo" bar))

        0 '"fee foo"'
        1 'bar'

    split_and_list(q("fee foo"\ bar))

        0 '"fee foo"\ bar'

The default DELIMITER '\s+' handles newlines. split_quoted("foo\nbar\n") returns ('foo', 'bar', '') and hence can be used to to split a large string of unchomp'd input lines into words:

    split_and_list("foo  \r\n bar\n")

        0 'foo'
        1 'bar'
        2 ''

The DELIMITER matches everywhere outside of quoted constructs, so in case of the default '\s+' you may want to remove heading/trailing whitespace. Consider

    split_and_list("\nfoo")
    split_and_list("\tfoo")

        0 ''
        1 'foo'

and

    split_and_list(" foo ")

        0 ''
        1 'foo'
        2 ''

parse_quoted additionally removes all quotes and backslashes from the splitted fields:

    sub parse_and_list($) {
        print ($i++, " '$_'\n") foreach parse_quoted(shift)
    }

    parse_and_list(q("fee foo" bar))

        0 'fee foo'
        1 'bar'

    parse_and_list(q("fee foo"\ bar))

        0 'fee foo bar'

MORE EXAMPLES

String 'field\ one "field\ two"':

    ('field\ one', '"field\ two"')  # split_quoted
    ('field one', 'field two')      # parse_quoted

String 'field\,one, field", two"' with a DELIMITER of '\s*,\s*':

    ('field\,one', 'field", two"')  # split_quoted
    ('field,one', 'field, two')     # parse_quoted

Split a large string $soup (mnemonic: slurped from a file) into lines, at LF or CR+LF:

    @lines = split_quoted($soup, '\r*\n');

Then transform all @lines by correctly splitting each line into "naked" values:

    @table = map { [ parse_quoted($_, '\s*,\s') ] } @lines

Here is some more complete code to parse a .csv-file with quoted fields, escaped commas:

    open my $fh, "foo.csv" or die $!;
    local $/;                   # enable localized slurp mode
    my $content = <$fh>;        # slurp whole file at once
    close $fh;
    my @lines = split_quoted($content, '\r*\n');
    die q(unbalanced " in input) unless @lines;
    my @table = map { [ map { parse_quoted($_, '\s*,\s') } ] } @lines

In core this is what "read_csv" does. "deep_compare" allows you to test what "split_quoted" and "parse_quoted" return. For example, the following code shall never die:

    croak if deep_compare([split_quoted("fee fie foo")], ['fee', 'fie', 'foo']);
    croak if deep_compare( parse_quoted('"fee fie foo"'), 1);
equal(NUM1, NUM2[, PRECISION])

"equal" returns true if NUM1 and NUM2 are equal to PRECISION number of decimal places (default: 6). For details see "round".

round(NUM1[, PRECISION])

Compare and round floating-point numbers NUM1 and NUM2 (as string- or number scalars).

When the "precision" compile option is defined, "round" is called during compilation on all numbers.

Normally round will return a number in fixed-point notation. When the package-global $Data::Rlist::RoundScientific is true, however, round formats the number in either normal or exponential (scientific) notation, whichever is more appropriate for its magnitude. This differs slightly from fixed-point notation in that insignificant zeroes to the right of the decimal point are not included. Also, the decimal point is not included on whole numbers. For example, "round"(42) does not return 42.000000, and round(0.12) returns 0.12, not 0.120000.

MACHINE ACCURACY

One needs a function like equal to compare floats, because IEEE 754 single- and double precision implementations are not absolute - in contrast to the numbers they actually represent. In all machines non-integer numbers are only an approximation to the numeric truth. In other words, they're not commutative. For example, given two floats a and b, the result of a+b might be different than that of b+a. For another example, it is a mathematical truth that a * b = b * a, but not necessarily in a computer.

Each machine has its own accuracy, called the machine epsilon, which is the difference between 1 and the smallest exactly representable number greater than one. Most of the time only floats can be compared that have been carried out to a certain number of decimal places. In general this is the case when two floats that result from a numeric operation are compared - but not two constants. (Constants are accurate through to lexical conventions of the language. The Perl and C syntaxes for numbers simply won't allow you to write down inaccurate numbers.)

See also recipes 2.2 and 2.3 in the Perl Cookbook.

EXAMPLES

    CALL                    RETURNS NUMBER
    ----                    --------------
    round('0.9957', 3)       0.996
    round(42, 2)             42
    round(0.12)              0.120000
    round(0.99, 2)           0.99
    round(0.991, 2)          0.99
    round(0.99, 1)           1.0
    round(1.096, 2)          1.10
    round(+.99950678)        0.999510
    round(-.00057260)       -0.000573
    round(-1.6804e-6)       -0.000002
deep_compare(A, B[, PRECISION, TRACE_FLAG])

Compare and analyze two numbers, strings or references. Generates a list of messages describing exactly all unequal data. Hence, for any Perl data $a and $b one can assert:

    croak "$a differs from $b" if deep_compare($a, $b);

When PRECISION is defined all numbers in A and B are "round"'d before actually comparing them. When TRACE_FLAG is true traces progress.

RESULT

Returns an array of messages, each describing unequal data, or data that cannot be compared because of type- or value-mismatching. The array is empty when deep comparison of A and B found no unequal numbers or strings, and only indifferent types.

EXAMPLES

The result is line-oriented, and for each mismatch it returns a single message. For a simple example,

    Data::Rlist::deep_compare(undef, 1)

yields

    <<undef>> cmp <<1>>   stop! 1st undefined, 2nd defined (1)
fork_and_wait(PROGRAM[, ARGS...])

Forks a process and waits for completion. The function will extract the exit-code, test whether the process died and prints status messages on STDERR. fork_and_wait hence is a handy wrapper around the built-in system and exec functions. Returns an array of three values:

    ($exit_code, $failed, $coredump)

$exit_code is -1 when the program failed to execute (e.g. it wasn't found or the current user has insufficient rights). Otherwise $exit_code is between 0 and 255. When the program died on receipt of a signal (like SIGINT or SIGQUIT) then $signal stores it. When $coredump is true the program died and a core-file was written.

synthesize_pathname(TEXT...)

Concatenates and forms all TEXT strings into a symbolic name that can be used as a pathname. synthesize_pathname is a useful function to concatenate strings and nearby converting all characters that do not qualify as filename-characters, into "_" and "-". The result cannot only be used as file- or URL name, but also (coinstantaneously) as hash key, database name etc.

Compile Options

The format of the compiled text and the behavior of "compile" can be controlled by the OPTIONS parameter of "write", "write_string" etc. The argument is a hash defining how the Rlist text shall be formatted. The following pairs are recognized:

'precision' => PLACES

Make "compile" round all numbers to PLACES decimal places, by calling "round" on each scalar that looks like a number. By default PLACES is undef, which means floats are not rounded.

'scientific' => FLAG

Causes "compile" to masquerade $Data::Rlist::RoundScientific. See "round".

'code_refs' => TOKEN

Defines how "compile" shall treat CODE reference. Legal values for TOKEN are 0 (the default), "call" and "deparse".

- 0 compiles subroutine references into the string "?CODE?".

- "call" calls the code, then compiles the return value.

- "deparse" serializes the code using B::Deparse (reproducing the Perl source).

'threads' => COUNT

If enabled "compile" internally use multiple threads. Note that can speedup compilation only on machines with at least COUNT CPUs.

'here_docs' => FLAG

If enabled strings with at least two newlines in them are written as here-document, when possible. To qualify as here-document a string has to have at least two LFs ("\n"), one of which must terminate it.

'auto_quote' => FLAG

When true (default) do not quote strings that look like identifiers (see "is_symbol"). When false quote all strings. Hash keys are not affected.

"write_csv" and "write_conf" interpret this flag differently: false means not to quote at all; true quotes only strings that don't look like numbers and that aren't yet quoted.

'outline_data' => NUMBER

When NUMBER is greater than 0 use "eol_space" (linefeed) to split data to many lines. It will insert a linefeed after every NUMBERth array value.

'outline_hashes' => FLAG

If enabled, and "outline_data" is also enabled, prints { and } on distinct lines when compiling Perl hashes with at least one pair.

'separator' => STRING

The comma-separator string to be used by "write_csv". The default is ','.

'delimiter' => REGEX

Field-delimiter for "read_csv". There is no default value. To read configuration files, for example, you may use '\s*=\s*' or '\s+'. To read CSV-files use e.g. '\s*[,;]\s*'.

The following options format the generated Rlist; normally you don't want to modify them:

'bol_tabs' => COUNT

Count of physical, horizontal TAB characters to use at the begin-of-line per indentation level. Defaults to 1. Note that we don't use blanks, because they blow up the size of generated text without measure.

'eol_space' => STRING

End-of-line string to use (the linefeed). For example, legal values are "", " ", "\n", "\r\n" etc. The default is undef, which means to use the current value of $/. Note that this is a compile-option that only affects "compile". When parsing files the builtin readline function is called, which uses $/.

'paren_space' => STRING

String to write after ( and {, and before } and ) when compiling arrays and hashes.

'comma_punct' => STRING
'semicolon_punct' => STRING

Comma and semicolon strings, which shall be at least "," and ";". No matter what, "compile" will always print the "eol_space" string after the "semicolon_punct" string.

'assign_punct' => STRING

String to make up key/value-pairs. Defaults to " = ".

Predefined Options

The OPTIONS parameter accepted by some package functions is either a hash-ref or the name of a predefined set:

'default'

Default if writing to a file.

'string'

Compact, no newlines/here-docs. Renders a "string of data".

'outlined'

Optimize the compiled Rlist for maximum readability.

'squeezed'

Very compact, no whitespace at all. For very large Rlists.

'perl'

Compile data in Perl syntax, using "compile_Perl", not "compile". The output then can be eval'd, but it cannot be "read" back.

'fast' or undef

Compile data as fast as possible, using "compile_fast", not "compile".

All functions that define an OPTIONS parameter do implicitly call "complete_options" to complete the argument from one of the predefined sets, and additionally from "default". Therefore you can always define nothing, or a "lazy subset of options". For example,

    my $obj = new Data::Rlist(-data => $thing);

    $obj->write('thing.rls', { scientific => 1, precision => 8 });

Exports

Example:

    use Data::Rlist qw/:floats :strings/;

Exporter Tags

:floats

Imports "equal", "round" and "is_number".

:strings

Imports "maybe_quote7", "quote7", "escape7", "unquote7", "unescape7", "unhere", "is_random_text", "is_number", "is_symbol", "split_quoted", and "parse_quoted".

:options

Imports "predefined_options" and "complete_options".

:aux

Imports "deep_compare", "fork_and_wait" and "synthesize_pathname".

Auto-Exported Functions

The following functions are implicitly imported into the callers symbol table. (But you may say require Data::Rlist instead of use Data::Rlist to prohibit auto-import. See also perlmod.)

ReadData(INPUT[, FILTER, FILTER-ARGS])
ReadCSV(INPUT[, OPTIONS, FILTER, FILTER-ARGS])
ReadConf(INPUT[, OPTIONS, FILTER, FILTER-ARGS])

These are aliases for Data::Rlist::"read", Data::Rlist::"read_csv" and Data::Rlist::"read_conf".

EvaluateData(INPUT[, FILTER, FILTER-ARGS])

Like "ReadData" but implicitly call Data::Rlist::"evaluate_nanoscripts" in case parsing was successful.

WriteData(DATA[, OUTPUT, OPTIONS, HEADER])
WriteCSV(DATA[, OUTPUT, OPTIONS, COLUMNS, HEADER])
WriteConf(DATA[, OUTPUT, OPTIONS, HEADER])

These are aliases for Data::Rlist::"write", Data::Rlist::"write_string" Data::Rlist::"write_csv" and Data::Rlist::"write_conf". OPTIONS default to "default".

OutlineData(DATA[, OPTIONS])
StringizeData(DATA[, OPTIONS])
SqueezeData(DATA[, OPTIONS])

These are aliases for Data::Rlist::"write_string_value". OutlineData applies the predefined "outlined" options, while StringizeData applies "string" and SqueezeData() "squeezed". When specified, OPTIONS are merged into the. For example,

    print "\n\$thing: ", OutlineData($thing, { precision => 12 });

rounds all numbers in $thing to 12 digits.

PrintData(DATA[, OPTIONS])

An alias for

    print OutlineData(DATA, OPTIONS);
KeelhaulData(DATA[, OPTIONS])
CompareData(A, B[, PRECISION, TRACE_FLAG])

These are aliases for "keelhaul" and "deep_compare". For example,

    use Data::Rlist;
        .
        .
    my($copy, $as_text) = KeelhaulData($thing);

EXAMPLES

String- and number values:

    "Hello, World!"
    foo                         # compiles to { 'foo' => undef }
    3.1415                      # compiles to { 3.1415 => undef }

Array values:

    (1, a, 4, "b u z")          # list of numbers/strings

    ((1, 2),
     (3, 4))                    # list of list (4x4 matrix)

    ((1, a, 3, "foo bar"),
     (7, c, 0, ""))             # another list of lists

Here-document strings:

        $hello = ReadData(\<<HELLO)
        ( <<DEUTSCH, <<ENGLISH, <<FRANCAIS, <<CASTELLANO, <<KLINGON, <<BRAINF_CK )
    Hallo Welt!
    DEUTSCH
    Hello World!
    ENGLISH
    Bonjour le monde!
    FRANCAIS
    Ola mundo!
    CASTELLANO
    ~ nuqneH { ~ 'u' ~ nuqneH disp disp } name
    nuqneH
    KLINGON
    ++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++
    ..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.
    BRAINF_CK
    HELLO

Compiles $hello as

    [ "Hallo Welt!\n", "Hello World!\n", "Bonjour le monde!\n", "Ola mundo!\n",
      "~ nuqneH { ~ 'u' ~ nuqneH disp disp } name\n",
      "++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++\n..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.\n" ]

Configuration object as hash:

    {
        contribution_quantile = 0.99;
        default_only_mode = Y;
        number_of_runs = 10000;
        number_of_threads = 10;
        # etc.
    }

Altogether:

    Metaphysic-terms =
    {
        Numbers =
        {
            3.141592653589793 = "The ratio of a circle's circumference to its diameter.";
            2.718281828459045 = <<___;
The mathematical constant "e" is the unique real number such that the value of
the derivative (slope of the tangent line) of f(x) = e^x at the point x = 0 is
exactly 1.
___
            42 = "The Answer to Life, the Universe, and Everything.";
        };

        Words =
        {
            ACME = <<Value;
A fancy-free Company [that] Makes Everything: Wile E. Coyote's supplier of equipment and gadgets.
Value
            <<Key = <<Value;
foo bar foobar
Key
[JARGON] A widely used meta-syntactic variable; see foo for etymology.  Probably
originally propagated through DECsystem manuals [...] in 1960s and early 1970s;
confirmed sightings go back to 1972. [...]
Value
        };
    };

NOTES

The Random Lists (Rlist) syntax is inspired by NeXTSTEP's Property Lists. But Rlist is simpler, more readable and more portable. The Perl and C++ implementations are fast, stable and free. Markus Felten, with whom I worked a few month in a project at Deutsche Bank, Frankfurt in summer 1998, arrested my attention on Property lists. He had implemented a Perl variant of it (http://search.cpan.org/search?dist=Data-PropertyList).

The term "Random" underlines the fact that the language

  • has four primitive/anonymuous types;

  • the basic building block is a list, which is combined at random with other lists.

Hence the term Random does not mean aimless or accidental. Random Lists are arbitrary lists.

Data::Dumper

The main difference between Data::Dumper and Data::Rlist is that scalars will be properly encoded as number or string. Data::Dumper writes numbers always as quoted strings, for example

    $VAR1 = {
                'configuration' => {
                                    'verbose' => 'Y',
                                    'importance_sampling_loss_quantile' => '0.04',
                                    'distribution_loss_unit' => '100',
                                    'default_only' => 'Y',
                                    'num_threads' => '5',
                                            .
                                            .
                                   }
            };

where Data::Rlist writes

    {
        configuration = {
            verbose = Y;
            importance_sampling_loss_quantile = 0.04;
            distribution_loss_unit = 100;
            default_only = Y;
            num_threads = 5;
                .
                .
        };
    }

As one can see Data::Dumper writes the data right in Perl syntax, which means the dumped text can be simply eval'd, and the data can be restored very fast. Rlists are not quite Perl-syntax: a dedicated parser is required. But therefore Rlist text is portable and can be read from other programming languages such as "C++".

With $Data::Dumper::Useqq enabled it was observed that Data::Dumper renders output significantly slower than "compile". This is actually suprising, since Data::Rlist tests for each scalar whether it is numeric, and truely quotes/escapes strings. Data::Dumper quotes all scalars (including numbers), and it does not escape strings. This may also result in some odd behaviors. For example,

    use Data::Dumper;
    print Dumper "foo\n";

yields

    $VAR1 = 'foo
    ';

while

    use Data::Rlist;
    PrintData "foo\n"

yields

    { "foo\n"; }

Finally, Data::Rlist generates smaller files. With the default $Data::Dumper::Indent of 2 Data::Dumper's output is 4-5 times that of Data::Rlist's. This is because Data::Dumper recklessly uses blanks, instead of horizontal tabulators, which blows up file sizes without measure.

Rlist vs. Perl Syntax

Rlists are not Perl syntax:

    RLIST    PERL
    -----    ----
     5;       { 5 => undef }
     "5";     { "5" => undef }
     5=1;     { 5 => 1 }
     {5=1;}   { 5 => 1 }
     (5)      [ 5 ]
     {}       { }
     ;        { }
     ()       [ ]

Debugging Data

To reduce recursive data structures (into true hierachies) set $Data::Rlist::MaxDepth to an integer above 0. It then defines the depth under which "compile" shall not venture deeper. The compilation of Perl data (into Rlist text) then continues, but on STDERR a message like the following is printed:

    ERROR: compile2() broken in deep ARRAY(0x101aaeec) (depth = 101, max-depth = 100)

This message will also be repeated as comment when the compiled Rlist is written to a file. Furthermore $Data::Rlist::Broken is incremented by one. While the compilation continues, effectively any attempt to venture deeper as suggested by $Data::Rlist::MaxDepth will be blocked.

See "broken".

Speeding up Compilation (Explicit Quoting)

Much work has been spent to optimize Data::Rlist for speed. Still it is implemented in pure Perl (no XS). A rough estimation for Perl 5.8 is "each MB takes one second per GHz". For example, when the resulting Rlist file has a size of 13 MB, compiling it from a Perl script on a 3-GHz-PC requires about 5-7 seconds. Compiling the same data under Solaris, on a sparcv9 processor operating at 750 MHz, takes about 18-22 seconds.

The process of compiling can be speed up by calling "quote7" explicitly on scalars. That is, before calling "write" or "write_string". Big data sets may compile faster when for scalars, that certainly not qualify as symbolic name, "quote7" is called in advance:

    use Data::Rlist qw/:strings/;

    $data{quote7($key)} = $value;
        .
        .
    Data::Rlist::write("data.rlist", \%data);

instead of

    $data{$key} = $value;
        .
        .
    Data::Rlist::write("data.rlist", \%data);

It depends on the case whether the first variant is faster: "compile" and "compile_fast" both have to call "is_random_text" on each scalar. When the scalar is already quoted, i.e., its first character is ", this test ought to run faster.

Internally "is_random_text" applies the precompiled regex $Data::Rlist::REValue. Note that the expression ($s!~$Data::Rlist::REValue) can be up to 20% faster than the equivalent is_random_text($s).

Quoting strings that look like numbers

Normally you don't have to care about strings, since un/quoting happens as required when reading/compiling Rlist or CSV text. A common problem, however, occurs when some string uses the same lexicography than numbers do.

Perl defines the string as the basic building block for all program data, then lets the program decide what strings mean. Analogical, in a printed book the reader has to decipher the glyphs and decide what evidence they hide. Printed text uses well-defined glyphs and typographic conventions, and finally the competence of the reader, to recognize numbers. But computers need to know the exact number type and format. Integer? Float? Hexadecimal? Scientific? Klingon? The Perl Cookbook recommends the use of a regular expression to distinguish number from string scalars (recipe 2.1).

In Rlist, string scalars that look like numbers need to be quoted explicitly. Otherwise, for example, the string scalar "-3.14" appears as -3.14 in the output, "007324" is compiled into 7324 etc. Such text is lost and read back as a number. Of course, in most cases this is just what you want. For hash keys, however, it might be a problem. One solution is to prefix the string with "_":

    my $s = '-9'; $s = "_$s";

Such strings do not qualify as a number anymore. In the C++ implementation it will then become some std::string, not a double. But the leading "_" has to be removed by the reading program. Perhaps a better solution is to explicitly call "quote7":

    use Data::Rlist qw/:strings/;

    $k = -9;
    $k = quote7($k);            # returns qq'"-9"'

    $k = 3.14_15_92;
    $k = quote7($k);            # returns qq'"3.141592"'

Again, the need to quote strings that look like numbers is a problem evident only in the Perl implementation of Rlist, since Perl is a language with weak types. With the C++ implementation of Rlist there's no need to quote strings that look like numbers.

See also "write", "is_number", "is_symbol", "is_random_text" and http://en.wikipedia.org/wiki/American_Standard_Code_for_Information_Interchange.

Installing Rlist.pm locally

Installing CPAN packages usually requires administrator privileges. Another way is to copy the Rlist.pm file into a directory of your choice. Instead of use Data::Rlist;, however, you then use the following code. It will find Rlist.pm also in . and ~/bin, and it calls the Exporter explicitly:

    BEGIN {
        $0 =~ /[^\/]+$/;
        push @INC, $`||'.', "$ENV{HOME}/bin";
        require Rlist;
        Data::Rlist->import();
        Data::Rlist->import(qw/:floats :strings/);
    }

An Rlist-Mode for Emacs

    (define-generic-mode 'rlist-generic-mode
       (list "//" ?#)
       nil
       '(;; Punctuators
         ("\\([(){},;?=]\\)" 1 'cperl-array-face)
         ;; Numbers
         ("\\([-+]?[0-9]+\\(\\.[0-9]+\\)?[dDlL]?\\)" 1 'font-lock-constant-face)
         ;; Identifier names
         ("\\([-~A-Za-z_][-~A-Za-z0-9_]+\\)" 1 'font-lock-variable-name-face))
       (list "\\.[rR][lL][iI]?[sS]$")
       ;; Extra functions to setup mode.
       (list 'generic-bracket-support
             '(lambda()
               (require 'cperl-mode)
               ;;(hl-line-mode t)                      ; highlight cursor-line
               (local-set-key [?\t] (lambda()(interactive)(cperl-indent-command)))
               (local-set-key [?\M-q] 'fill-paragraph)
               (set-fill-column 100)))
       "Generic mode for Random Lists (Rlist) files.")

Implementation Details

Perl

Package Dependencies

Data::Rlist depends only on few other packages:

    Exporter
    Carp
    strict
    integer
    Sys::Hostname
    Scalar::Util        # deep_compare() only
    Text::Wrap          # unhere() only
    Text::ParseWords    # split_quoted(), parse_quoted() only

Data::Rlist is free of $&, $` or $'. Reason: once Perl sees that you need one of these meta-variables anywhere in the program, it has to provide them for every pattern match. This may substantially slow your program (see also perlre).

A Short Story of Typeglobs

This is supplement information for "compile", the function internally called by "write" and "write_string". We will discuss why "compile", "compile_fast" and "compile_Perl" transliterate typeglobs and typeglob-refs into "?GLOB?". This is an attempted explanation.

TYPEGLOBS ARE A PERL IDIOSYNCRACY

Perl uses a symbol table per package to map symbolic names like x to Perl values. Typeglob (aka glob) objects are complete symbol table entries, as hash values. The symbol table hash (stash) is an ordinary hash, named like the package with two colons appended. In the package stash the symbol name is mapped to a memory address which holds the actual data of your program. In Perl we do not have real global values, only package globals. Any Perl code is always running in one package or another.

The main symbol table's name is %main::, or %::. In the C implementation of the Perl interpreter, the main symbol is simply a global variable, called the defstash (default stash). The symbol Data:: in stash %:: addresses the stash of package Data, and the symbol Rlist:: in the stash %::Data:: addresses the stash of package Data::Rlist.

Typeglobs are an idiosyncracy of Perl: different types need only one stash entry, so that one symbol can name all types of Perl data (scalars, arrays, hashes) and nondata (functions, formats, I/O handles). The symbol x is mapped to the typeglob *x. In the typeglob coexist the scalar $x, the list @x, the hash %x, the code &x and the I/O-handle or format specifieer x.

Most of the time only one glob slot is used. Do typeglobs waste space then? Probably not. (Although some authors believe that.) Other script languages like (e.g.) Python is not forcing decoration characters -- the interpreter already knows the type. In terms of C, symbol table entries are then struct/union-combinations with a type field, a double field, a char* field and so forth. Perl symbols follow a contrary design: globs are really pointer sets to low-level structs that hold numbers, strings etc. Naturally pointers to non-existing values are NULL, and so no type field is required. Perl interpreters can now implement fine-grained smart-pointers for reference-counting and copy-on-write, and must not necessarily handle abstract unions. In theory, the garbage-collector should have "increased recycling opportunities." We do know, for example, that perl is very greedy with RAM: it almost never returns any memory to the operating system.

Modifying $x in a Perl program won't change %x, because the typeglob *x is interposed between the stash and the program's actual values for $x, @x etc. The sigil * serves as wildcard for the other sigils %, @, $ and &. (Hint: a sigil is a symbol "created for a specific magical purpose"; the name derives from the latin sigilum = seal.)

Typeglobs cannot be dissolved by "compile", because when (e.g.) $x and %x are in use, the glob *x does not return some useful value like

    (SCALAR => \$x, HASH => \@x)

Typeglobs are also not interpolated in strings. perl always plays the ball back. A typeglob-value is simply a string:

    $ perl -e '$x=1; @x=(1); print *x'
    *main::x

    $ perl -e 'print "*x is not interpolated"'
    *x is not interpolated

    $ perl -e '$x = "this"; print "although ".*x." could be a string"'
    although *main::x could be a string

As one can see, even when only $x is defined the *x does not return its value. Typeglobs (stash entries) are arranged by perl on the fly, even with the use strict pragma in effect:

    $ perl -e 'package nirvana; use strict; print *x'
    *nirvana::x

Each typeglob is a full path into the perl stashes, down from the defstash:

    $ perl -e 'print "*x is \"*main::x\"" if *x eq "*main::x"'
    *x is "*main::x"

    $ perl -e 'package nirvana; sub f { local *g=shift; print *g."=$g" }; package main; $x=42; nirvana::f(*x)'
    *main::x=42

GLOB-REFS

In the C implementation of Perl, typeglobs have the struct-type GV for "Glob value". Each GV is merely a set of pointers to sub-objects for scalars, arrays, hashes etc. In Perl the special syntax *x{ARRAY} accesses the array-sub-object, and is another way to say \@x. But when applied to a typeglob as \*foo it returns a typeglob-ref, or globref. So the Perl backslash operator \ works like the address-of operator & in C.

    $ perl -e 'print *::'
    *main::main::               # ???

    $ perl -e '$x = 42; print $::{x}'
    *main::x                    # typeglob-value 'x' in the stash

    $ perl -e 'print \*::'
    GLOB(0x10010f08)            # some globref

Little do we know what happens inside perl, when we assign REFs to typeglobs:

    $ perl -e '$x = 42; *x = \$x; print $x'
    42
    $ perl -e '$y = 42; *x = \$y; print $x'
    42

In Perl4 you had to pass typeglob-refs to call functions by references (the backslash-operator was not yet "invented"). Since Perl5 saw the light of day, typeglob-refs can be considered as artefacts. Note, however, that these veterans are still faster than true references, because true references are themselves stored in a typeglob (as REF type) and so need to be dereferenced. Globrefs can be used directly (as GV*'s) by perl. For example,

    void f1 { my $bar = shift; ++$$bar }
    void f2 { local *bar = shift; ++$bar }

    f1(\$x);                  # increments $x
    f1(*x);                   # dto., but faster

GLOB-ALIASES

Typeglob-aliases offer another interesting application for typeglobs. For example, *bar=*x aliases the symbol bar in the current stash, so that x and bar point to the same typeglob. This means that when you declare sub x {} after casting the alias, bar is x.

This smells like a free lunch. The penalty, however, is that the bar symbol cannot be easily removed from the stash. One way is to say local *bar, wich temporarily assigns a new typeglob to bar with all pointers zeroized:

    package nirvana;

    sub f { print $bar; }
    sub g { local *bar; $bar = 42; f(); }

    package main;

    nirvana::g();

Running this code as Perl script prints the number assigned in g. f acts as a closure. The local-statement will put the bar symbol temporarily into the package stash %::nirvana, i.e., the same stash in which f and g exist. It will remove bar when g returns.

*foo{THINGS}s

The *x{NAME} expression family is fondly called "the *foo{THING} syntax":

    $scalarref = *x{SCALAR};
    $arrayref  = *ARGV{ARRAY};
    $hashref   = *ENV{HASH};
    $coderef   = *handlers{CODE};

    $ioref     = *STDIN{IO};
    $ioref     = *STDIN{FILEHANDLE};    # same as *STDIN{IO}

    $globref   = *x{GLOB};
    $globref   = \*x;                   # same as *x{GLOB}
    $undef     = *x{THIS_NAME_IS_NOT_SUPPORTED} # yields undef

    die unless defined *x{SCALAR};      # ok -> will not die
    die unless defined *x{GLOB};        # ok
    die unless defined *x{HASH};        # error -> will die

When THINGs are accessed this way few rules apply. Firstofall, *foo{THING}s are not hashes. The syntax is a stopgap:

    $ perl -e 'print \*x, *x{GLOB}, \*x{GLOB}'
    GLOB(0x100110b8)GLOB(0x100110b8)REF(0x1002e944)

    $ perl -e '$x=1; exists *x{GLOB}'
    exists argument is not a HASH or ARRAY element at -e line 1.

Some *foo{THING} is undef if the requested THING hasn't been used yet. Only *foo{SCALAR} returns an anonymous scalar-reference:

    $ perl -e 'print "nope" unless defined *foo{HASH}'
    nope
    $ perl -e 'print *foo{SCALAR}'
    SCALAR(0x1002e94c)

In Perl5 it is still not possible to get a reference to an I/O-handle (file-, directory- or socket handle) using the backslash operator. When a function requires an I/O-handle you must therefore pass a globref. More precisely, it is possible to pass an IO::Handle-reference, a typeglob or a typeglob-ref as the filehandle. This is obscure bot only for new Perl programmers.

    sub logprint($@) {
        my $fh = shift;
        print $fh map { "$_\n" } @_;
    }

    logprint(*STDOUT{IO}, 'foo');   # pass IO-handle -> IO::Handle=IO(0x10011b44)
    logprint(*STDOUT, 'bar');       # ok, pass typeglob-value -> '*main::STDOUT'
    logprint(\*STDOUT, 'bar');      # ok, pass typeglob-ref -> 'GLOB(0x10011b2c)'
    logprint(\*STDOUT{IO}, 'nope'); # ERROR -> won't accept 'REF(0x10010fe0)'

It is very amusing that Perl, although refactoring UNIX in form of a language, does not make clear what a file- or socket-handle is. The global symbol STDOUT is actually an IO::Handle object, which perl had silently instantiated. To functions like print, however, you may pass an IO::Handle, globname or globref.

VIOLATING STASHES

As we saw we can access the Perl guts without using a scalpel. Suprisingly, it is also possible to touch the stashes themselves:

    $ perl -e '$x = 42; *x = $x; print *x'
    *main::42

    $ perl -e '$x = 42; *x = $x; print *42'
    *main::42

By assigning the scalar value $x to *x we have demolished the stash (at least, logically): neither $42 nor $main::42 are accessible. Symbols like 42 are invalid, because 42 is a numeric literal, not a string literal.

    $ perl -e '$x = 42; *x = $x; print $main::42'

Nevertheless it is easy to confuse perl this way:

    $ perl -e 'print *main::42'
    *main::42

    $ perl -e 'print 1*9'
    9

    $ perl -e 'print *9'
    *main::9

    $ perl -e 'print *42{GLOB}'
    GLOB(0x100110b8)

    $ perl -e '*x = 42; print $::{42}, *x'
    *main::42*main::42

    $ perl -v
    This is perl, v5.8.8 built for cygwin-thread-multi-64int
    (with 8 registered patches, see perl -V for more detail)

Of course these behaviors are not reliable, and may disappear in future versions of perl. In German you say "Schmutzeffekt" (dirt effect) for certain mechanical effects that occur non-intendedly, because machines and electrical circuits are not perfect, and so is software. However, "Schmutzeffekts" are neither bugs nor features; these are phenomenons.

LEXICAL VARIABLES

Lexical variables (my variables) are not stored in stashes, and do not require typeglobs. These variables are stored in a special array, the scratchpad, assigned to each block, subroutine, and thread. These are really private variables, and they cannot be localized. Each lexical variable occupies a slot in the scratchpad; hence is addressed by an integer index, not a symbol. my variables are like auto variables in C. They're also faster than locals, because they can be allocated at compile time, not runtime. Therefore you cannot declare *x lexically:

    $ perl -e 'my(*x)'
    Can't declare ref-to-glob cast in "my" at -e line 1, near ");"

Seel also the Perl man-pages perlguts, perlref, perldsc and perllol.

C++

In C++ we use a flex/bison scanner/parser combination to read Rlist language productions. The C++ parser generates an Abstract Syntax Tree (AST) of double, std::string, std::vector and std::map values. Since each value is put into the AST, as separate object, we use a free store management that allows the allocation of huge amounts of tiny objects.

We also use reference-counted smart-pointers, which allocate themselves on our fast free store. So RAM will not be fragmented, and the allocation of RAM is significantly faster than with the default process heap. Like with Perl, Rlist files can have hundreds of megabytes of data (!), and are processable in constant time, with constant memory requirements. For example, a 300 MB Rlist-file can be read from a C++ process which will not peak over 400-500 MB of process RAM.

BUGS

There are no known bugs, this package is stable. Deficiencies and TODOs:

  • The "deparse" functionality for the "code_refs" compile option has not yet been implemented.

  • The "threads" compile option has not yet been implemented.

  • IEEE 754 notations of Infinite and NaN not yet implemented.

  • "compile_Perl" is experimental.

COPYRIGHT/LICENSE

Copyright 1998-2008 Andreas Spindler

Maintained at CPAN (http://search.cpan.org/dist/Data-Rlist/) and the author's site (http://www.visualco.de). Please send mail to rlist@visualco.de.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.

Contact the author for the C++ library at rlist@visualco.de.

Thank you for your attention.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 334:

Non-ASCII character seen before =encoding in '"Olá'. Assuming CP1252