Peter Karman > Search-Tools > Search::Tools::Tokenizer

Download:
Search-Tools-1.000.tar.gz

Dependencies

Annotate this POD

Website

CPAN RT

Open  0
View/Report Bugs
Module Version: 1.000   Source  

NAME ^

Search::Tools::Tokenizer - split a string into meaningful tokens

SYNOPSIS ^

 use Search::Tools::Tokenizer;
 my $tokenizer = Search::Tools::Tokenizer->new();
 my $tokens = $tokenizer->tokenize('quick brown red dog');
 while ( my $token = $tokens->next ) {
     # token isa Search::Tools::Token
     print "token = $token\n";
     printf("str: %s, len = %d, u8len = %d, pos = %d, is_match = %d, is_hot = %d\n",
        $token->str,
        $token->len, 
        $token->u8len, 
        $token->pos, 
        $token->is_match, 
        $token->is_hot
     );
 }

DESCRIPTION ^

A Tokenizer object splits a string into Tokens based on a regex. Tokenizer is used primarily by the Snipper class.

METHODS ^

Most of Search::Tools::Tokenizer is written in C/XS so if you view the source of this class you will not see much code. Look at the source for Tools.xs and search-tools.c if you are interested in the internals.

This class inherits from Search::Tools::Object. Only new or overridden methods are documented here.

BUILD

Called by new().

re([ regex ])

Get/set the regex used by tokenize() tokenize_pp(). Typically you set this once in new(). The default value is:

 qr/\w+(?:'\w+)*/

which will match words and contractions (e.g., "do", "not" and "don't").

tokenize( string [, heat_seeker, match_num] )

Returns a TokenList object representin the Tokens in string. string is "split" according to the regex in re().

heat_seeker can be either a CODE reference or a regex object (qr//) to use for testing is_hot per token. An example CODE reference:

 my $tokens = $tokenizer->tokenize('foo bar', sub { 
    my ($token) = @_;
    # do something with token during initial iteration
 },);

match_num is the parentheses number to consider the matching token in the re() value. The default is 0 (the entire matching pattern).

tokenize_pp( string )

Returns a TokenListPP object.

A pure-Perl implementation of tokenize(). Mostly written so you can see what the XS algorithm does, if you are so inclined, and so the author could benchmark the two implementations and thereby feel some satisfaction at having spent the time writing the XS/C version (2-3x faster than Perl).

get_offsets( string, regex )

Returns an array ref of pos() values for start offsets of regex within string

set_debug( n )

Sets the XS debugger on. By default, setting debug(1) (which is inherited from Search::Tools::Object) is not sufficient to trigger the XS debugging. Use set_debug() if you want lots of info on stderr.

AUTHOR ^

Peter Karman <karman@cpan.org>

BUGS ^

Please report any bugs or feature requests to bug-search-tools at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Search-Tools. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT ^

You can find documentation for this module with the perldoc command.

    perldoc Search::Tools

You can also look for information at:

COPYRIGHT ^

Copyright 2009 by Peter Karman.

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: