Text::Tokenizer - Perl extension for tokenizing text(config) files
use Text::Tokenizer ':all'; #open file and set add it to tokenizer inputs open(F_CONFIG, "input.conf") || die("failed to open input.conf"); $tok_id = tokenizer_new(F_CONFIG); tokenizer_options(TOK_OPT_NOUNESCAPE|TOK_OPT_PASSCOMMENT); while(1) { ($string, $tok_type, $line, $err, $errline) = tokenizer_scan(); last if($tok_type == TOK_ERROR || $tok_type == TOK_EOF); if($tok_type == TOK_TEXT) { } elsif($tok_type == TOK_BLANK) { } elsif($tok_type == TOK_DQUOTE) { $string = "\"$str\""; } elsif($tok_type == TOK_SQUOTE) { $string = "\'$str\'"; } elsif($tok_type == TOK_SIQUOTE) { $string = "\`$str\'"; } elsif($tok_type == TOK_IQUOTE) { $string = "\`$str\`"; } elsif($tok_type == TOK_EOL) { $string = "\n"; } elsif($tok_type == TOK_COMMENT) { } elsif($tok_type == TOK_UNDEF) { last; } else { last; }; print $string; } tokenizer_delete($tok_id); Very complex example of using Text::Tokenizer can be found in passwd_exp - tool for password expiration notification (http://devel.dob.sk/passwd_exp)
Text::Tokenizer is very fast lexical analyzer, that can be used to process input text from file or buffer to basic tokens:
NORMAL TEXT
DOUBLE QUOTED "TEXT"
SINGLE QUOTED 'TEXT'
INVERSE QUOTED 'TEXT'
SINGLE-INVERSE QUOTED `TEXT'
WHITESPACE TEXT
#COMMENTS
END OF LINE
END OF FILE
None by default. You have to selectively import methods or constants or use ':all' to import all constants & methods.
Undefined token (tokenizer error)
Normal_text
"Double quoted text"
'Single quoted text'
`Inverse quoted text`
`Single-inverse quoted text'
Whitespace text
#Comment
End of Line
End of File
Error Condition (see ERROR_TYPES)
ERROR_TYPES
No error
Unclosed double quote found
Unclosed single quote found
Unclosed inverse quote found
Failed to allocate tokenizer context (FATAL ERROR)
Default options set, equals to TOK_OPT_NOUNESCAPE
Set no options. Tokenizer will do in it's default behaviour - it will not unescape anything and it will not pass comments to you.
Disable characters & lines unescaping.
Enable looking for `single-inverse quote' combination.
Unescape chars & lines.
Unescape chars (inside of quotes only)
Unescape lines (inside of quotes only)
Enable comment passing to user routines.
Unescape lines (outside of quotes). Escaped end of line will not terminate value processing processing. So escaped multiline text will be returned as single line string.
Set tokenizer options.
Create new tokenizer instance(context) from FILE_HANDLE identified by $tok_id.
Create new tokenizer instance from string BUFFER long LENGTH characters. Return its tokenizer instance id.
Scan current tokenizer instance, and return first token found. @tok = ($string, $type, $line, $error, $error_line)
Test if tokenizer instance exists.
Switch to another tokenizer instance (like when you perform include statement).
Delete tokenizer instance You have to do it exactly on EOF to release tokenizer reference to file or buffer.
Flush tokenizer instance. This function discards the instance buffer\s contents, so the next time the scanner attempts to match a token from the buffer, it will have to fill it.
This tokenizer is based on code generated by flex - fast lexical analyzer generator (http://lex.sourceforge.net).
Samuel Behan, (http://devel.dob.sk)
Copyright 2003-2011 by Samuel Behan
This library is free software; you can redistribute it and/or modify it under the terms of GNU/GPL v3.
To install Text::Tokenizer, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Tokenizer
CPAN shell
perl -MCPAN -e shell install Text::Tokenizer
For more information on module installation, please visit the detailed CPAN module installation guide.