Leo Cacciari > IO-Tokenized-0.04 > IO::Tokenized

Download:
IO-Tokenized-0.04.tar.gz

Dependencies

Annotate this POD

CPAN RT

New  1
Open  0
View Bugs
Report a bug
Module Version: 0.04   Source  

NAME ^

IO::Tokenized - Extension of Perl for tokenized input

SYNOPSIS ^

  #Functional interface

  use IO::Tokenized qw/:parse/;
  
  open FOO,"<","some/input/file" or die "Can't open 'some/input/file': $!";
  setparser(\*FOO,[num => qr/\d+/],
                  [ident => qr/[a-z_][a-z0-9_]],
                  [op => qr![+*/-]!,\&opname]);
  
  while (my ($tok,$val) = gettoken(\*FOO)) {
    ... do something smart...
  }

  close(FOO);

ABSTRACT ^

Defines an extension to perl filehandles allowing spliting the input stream according to regular expressions.

DESCRIPTION ^

IO::Tokenized defines a bunch of functions allowing tokenized input from perl filehandles. In this alpha version tokens are specified by passing to the initialize_parsing function a list of token specifications. Each token specification is (a reference to) an array containing: the token name (a string), a regular expression defining the token and, optionally, an action function which calculates the value to be returned when a token matching the regexp is found.

Once the tokens are been specified, each invocation the gettoken function return a pair consisting of a token name and a token value or undef at end of file.

IO::Tokenized can also be used as a base class to add tokenized input methods to the object modules in the IO::* namespace. As an example, see the IO::Tokenized::File module, which is included in this distrution.

RATIONALE ^

Lexical analysis, which is a fundamental step in all parsing, mainly consists in decomposing an input stream into smal chunks called tokens. The tokens are in turn defined by regular expressions.

As Perl is good at handling regular expressions, one should expects that writing lexical analyser in Perl should be easy. In truth it is not, and tools like lex or flex are even been ported to Perl. There are also a whole lot of ad-hoc lexers for different parsing modules/programmes.

Now, approaches to lexical analysis like those underlying Parse::Lex and Parse::Flex are general but fairly complexes to use, while ad-hoc solutions are obviously, well... ad-hoc.

What I'd always sought was a way to tell to a file handle: "well, that is how the chunks I'm interested are. Please, found them in your input stream". It seems a simple thingh enough, but I could not found a module doing it.

Obviously, impatience pushed me to implement such a module, but until little time ago I had no real need for it, so lazines spoke against it. Recently I started to write a compiler for a scripting language and I started using the Parse::RecDescent module. There, in the documentation Damian Conway says

Why, regular expression on streams was exactly what I had in mind, so hubris kicked in and I wrote this module and its compagnon IO::Tokenized::File.

FUNCTIONS ^

The following functions are defined by the IO::tokenized module.

EXPORTS ^

IO::Tokenized does not export any function by default but all the above mentioned functions are exportable. There are, beside the classical :all, two more export tags: :parse, which exports initialize_parsing, gettoken and gettokens, and :buffer, which exports bufferspace, flushbuffer and resynch.

OBJECT ORIENTED ^

All the functions described above can be called in an object oriented way. For contructing IO::Tokenized objects a new method is provided which is basicaly a wrapper around initialize_parsing.

SEE ALSO ^

IO::Tokenized::File.

TOKENS SPECIFICATION ^

Tokens are specified, either to the new creator or to the settparser mutator, by a list of token definitions. Each token definition is (a reference to) an array with two or three elements. The first element represents the token name, the second one is the regexp defining the token itself while the third, if present, is the action function.

ACTION FUNCTIONS ^

As stated above, the user can associate a function to each token, called the action of the token. The action serves to purposes: it calculates the value of the token and completes the verification of the match. The action function specified in [token = $re,\&foo()]> will be called with the result of @item = $buffer =~ /($re)/s. The default action is simpli to pop @_, so giving the text that matched $re.

MATCHING STRATEGY ^

The gettoken function uses the following method to find the token to be returned.

1. it removes from the beginning of the internal buffer strings matching the skip regular expression as set by the token_separator function. In doing so, it can read more lines from the file into the buffer.
2. consider the token definitions in the order they where passed to settparser. If token token is defined by regexp $re, check that the buffer matches /^($re)/. If it is not so, then pass to the following token if any, to step 4. below if none.
3. if there is a user defined action for the token, apply it. If it returns undef then pass to the following token if any, to step 4. below if none. If the return value is defined, return a pair formed by the token name and the value itself. If there is no user defined action, then return a pair consisting of the token name and the matched string. Before returning, the buffer is updated removing the matched string.
4. if no match could be found, try reading one more line into the buffer and go back to step 2. If in entering step 4 the internal buffer holds more characters that was fixed by buffer_space then gettoken croacks.

CAVEATS ^

BUGS ^

Please remember that this is an alpha version of the module, and will stay so until the version number gets to 1.00. This means that there surely are plenty of bugs which aren't be discovered yet, more so because testing is all but complete.

Bugs reports are welcome and are to be addressed directly to the author at the address below.

TODO ^

There is still lot of work to do on this module, both at the programming level and at the conceptual level. Feature requests as well as insights are welcome.

AUTHOR ^

Leo "TheHobbit" Cacciari, <hobbit@cpan.org>

COPYRIGHT AND LICENSE ^

Copyright 2003 by Leo Cacciari

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.