Search::Tokenizer - Decompose a string into tokens (words)
# generic usage use Search::Tokenizer; my $tokenizer = Search::Tokenizer->new( regex => qr/.../, filter => sub { ... }, stopwords => {word1 => 1, word2 => 1, ... }, lower => 1, ); my $iterator = $tokenizer->($string); while (my ($term, $len, $start, $end, $index) = $iterator->()) { ... } # usage for DBD::SQLite (with builtin tokenizers: word, word_locale, # word_unicode, unaccent) use Search::Tokenizer; $dbh->do("CREATE VIRTUAL TABLE t " ." USING fts3(tokenize=perl 'Search::Tokenizer::unaccent')");
This module builds an iterator function that will progressively extract terms from a given input string. Terms are defined by a regular expression (for example \w+). Extraction of terms relies on the builtin "global match" operator of Perl (the 'g' flag), and therefore is quite efficient.
\w+
Before being returned to the caller, terms may be filtered by an auxiliary function, for performing tasks such as stemming or stopword elimination.
A tokenizer returned from the new method is a code reference, not a regular Perl object. To use the tokenizer, just call it with a string to parse : this will return another code reference, which works as an iterator. Each call to the iterator will return the next term from the string, until the string is exhausted.
This API was explicitly designed for integrating Perl with the FTS3 fulltext search engine in DBD::SQLite; however, the API is general enough to be useful for other purposes, which is why it is published in its own, separate distribution.
my $tokenizer = Search::Tokenizer->new($regex); my $tokenizer = Search::Tokenizer->new(%options);
Builds a new tokenizer, returned as a code reference. The first syntax with a single Regexp argument is a shorthand for ->new(regex => $regex). The second syntax, with named arguments, has the following available options :
->new(regex => $regex)
regex => $regex
$regex is a compiled regular expression that specifies how to match a term; that regular expression should not match the empty string (otherwise the tokenizer would enter into an infinite loop). The default is qr/\p{Word}+/. Here are some examples of more advanced regexes :
$regex
qr/\p{Word}+/
# perl's basic notion of "word" $regex = qr/\w+/; # take 'locale' into account $regex = do {use locale; qr/\w+/}; # words like "don't", "it's" are treated as a single term $regex = qr/\w+(?:'\w+)?/; # same thing but also with internal hyphens like "fox-trot" $regex = qr/\w+(?:[-']\w+)?/;
lower => $bool
If true, the term returned by the $regex is converted to lowercase (or more precisely: is "case-folded" through "fc" in Unicode::CaseFold). This option is activated by default.
filter => $filter
$filter is a reference to a function that may modify or cancel a term before it is returned to the caller. The filter takes one single argument (the term) and returns a scalar (the modified term). If the value returned from the filter is empty, then this term is canceled.
$filter
filter_in_place => $filter
Like filter, except that the filtering function directly modifies the term in its $_[0] argument instead of returning a new term. This is useful for example when building a filter from Lingua::Stem::Snowball or from Text::Transliterator::Unaccent.
filter
$_[0]
stopwords => $hashref
The keys in $hashref are terms to cancel (usually : common terms for which indexing would consume lots of resources with little added value). Values in the hash should evaluate to true. Lists of stopwords for various languages may be found in the Lingua::StopWords module. Stopwords filtering is applied after the filter or filter_in_place function (if any).
$hashref
filter_in_place
Whenever a term is canceled through the filter or stopwords options, the tokenizer does not return that term to the client, but nevertheless rembembers the canceled position: so for example when tokenizing "Once upon a time" with
$tokenizer = Search::Tokenizer->new( stopwords => Lingua::StopWords::getStopWords('en') );
we get the term sequence
("upon", 4, 5, 9, 1) ("time", 4, 12, 16, 3)
where terms "once" and "a" in positions 0 and 2 have been canceled, so the only remaining terms are in positions 1 and 3.
my $iterator = $tokenizer->($text); # loop over terms .. while (my $term = $iterator->()) { work_with_term($term); } # .. or loop over terms with detailed information while (my @term_details = $iterator->()) { work_with_details(@term_details); # ($term, $len, $start, $end, $index) }
The tokenizer takes one string argument and returns an iterator. The iterator takes no argument; each call returns a next term from the string, until the string is exhausted, at which point the iterator returns an empty result.
If called in a scalar context, the iterator returns just a string; if called in a list context, it returns a tuple composed from :
the term (after filtering);
the length of this term;
the starting offset in the string where this term was found;
the end offset. This is also the place where the search for the next term will start;
the position of this term within the string, starting at 0.
Length and start/end offsets are computed in characters, not in bytes. Note for SQLite users : the C layer in SQLite needs byte values, but the conversion will be automatically taken care of by the C implementation in DBD::SQLite.
Beware that ($end - $start) is the length of the original extracted term, while $len is the length of the final $term, after filtering; both lengths may differ, especially if stemming is being applied.
For convenience, the following tokenizers are builtin :
Search::Tokenizer::word
Terms are "words" according to Perl's notion of \w+.
Search::Tokenizer::word_locale
Terms are "words" according to Perl's notion of \w+ under use locale.
use locale
Search::Tokenizer::word_unicode
Terms are "words" according to Unicode's notion of \p{Word}+.
\p{Word}+
Search::Tokenizer::unaccent
Like Search::Tokenizer::word_unicode, but filtered through Text::Transliterator::Unaccent to replace all accented characters by their base character.
These builtin tokenizers may take the same arguments as new(): for example
new()
use Search::Tokenizer; my $tokenizer = Search::Tokenizer::unaccent(lower => 0, stopwords => ...);
my @tokens = Search::Tokenizer::unroll($iterator, $no_details);
This utility method returns the list of all tokens obtained from repetitive calls to the $iterator. The $no_details argument is optional; if true, the results are just strings, instead of tuples with positional information.
$iterator
$no_details
Other tokenizers on CPAN : KinoSearch::Analysis::Tokenizer and Search::Tools::Tokenizer.
Stopwords : Lingua::StopWords
Stemming : Lingua::Stem::Snowball
Removing accented characters : Text::Transliterator::Unaccent
Laurent Dami, <dami@cpan.org>
<dami@cpan.org>
Copyright 2010, 2021 Laurent Dami.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
To install Search::Tokenizer, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Search::Tokenizer
CPAN shell
perl -MCPAN -e shell install Search::Tokenizer
For more information on module installation, please visit the detailed CPAN module installation guide.