Victor Parada > Regexp-Keywords-0.03 > Regexp::Keywords

Download:
Regexp-Keywords-0.03.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 0.03   Source  

NAME ^

Regexp::Keywords - A regexp builder to test against keywords lists

VERSION ^

Version 0.03

SYNOPSIS ^

This module helps you to search inside a list of keywords for some of them, using a simple query syntax with AND, OR and NOT operators and grouping.

    use Regexp::Keywords;
    my $kw = Regexp::Keywords->new();
    
    my $wanted = 'comedy + ( action , romance ) - thriller';
    $kw->prepare($wanted);
    
    my $movie_tags = 'action,comedy,crime,fantasy,adventure';
    print "Buy ticket!\n" if $kw->test($movie_tags);

Keywords, also known as tags, are used to classify things in a category. Many tags can be assigned at the same time to an item, even if they belong to different available categories.

In real life, keywords lists are found in:

CONSTRUCTOR ^

new ( )

Creates a Keywords object. Some attributes can be initialized from the constructor:

See Attributes for a description of these attributes.

Example: To create a Keywords object that will be used to test strings with mixed case keywords:

  my $kw = Keywords->new(ignore_case => 1);

BUILDING METHODS ^

$kw->prepare( $query )

Parse a query and build a regexp pattern to be used for keywords strings tests. Dies on malformed query expressions. See Query Expressions later in this doc.

$kw->set( attribute => value [, ...] )

The following attributes can be changed after the object creation:

Dies on unknown attributes. See Attributes section for a description of each attribute.

Note: Some of this attributes invalidates the associated regexp if it was already built, so an automatic reparse or rebuild is done after changing all the specified attributes. For the same reason, is better to call set with many parameters instead of setting one at a time.

Note: It's not recommended to modify the attributes directly from the object, or you could get unexpected results if the query is not parsed or built again.

$kw->get( 'attribute' )

This method returns the current value for the specified attribute. Dies on unknown attributes. See Attributes section for a list of available attributes.

$kw->reparse( )

If any of the object's attribute changes, a reparse of the source query may be required, depending on the affected attribute. Dies on bad queries.

$kw->rebuild( )

If any of the object's attribute changes, a rebuild of the regexp may be required, depending on the affected attribute. Dies on bad parsed queries.

KEYWORDS TESTING METHODS ^

$kw->test( $keyword_list )

Returns true if the list matches the parsed query, otherwise returns false. Dies if no query has been parsed yet.

$kw->grep( @list_of_kwlists )

Returns an array only with the keywords lists that matches de parsed query. Dies if no query has been parsed yet.

  @selected_keys = $kw->grep_keys(map {$_ => $table{$_}[$col]} keys %table);

  @selected_indexes = $kw->grep_keys(map {$_ => $array[$_]} 1 .. $#table);

$kw->grep_keys( %hash_of_kwlists )

Returns an array of keys from a hash when their corresponding values satisfy the query. Dies if no query has been parsed yet.

EXPORTED FUNCTION ^

The following function can be imported and accessed directly from your program.

keywords_regexp( $query [, $ignore [, $multi [, $partial [, $texted ] ] ] ] )

Returns a regular expression (qr/.../) for a query to which keywords lists strings can be tested against.

See Attributes section for a description of the attributes for the corresponding parameters and the default values if ommitted.

ATTRIBUTES ^

Object's attributes can control how to parse a query, build a regular expression or test strings.

Is it possible to access them using $kw->{attribute}, it's better to read them with $kw->get() and change them with $kw->set(), because some validations are done to keep things consistent.

ignore_case

Defines if the regexp should be case (in)sensitive.

Defaults to case sensitive (a value of 0). Set to 1 turn the regexp into case insenitive.

Note: Changing this parameter with $kw->set() after regexp has been built, causes the regexp to be rebuilt from parsed_query.

multi_words

This attribute controls whether the keywords list may include many words as a single keyword.

The default (0) is to treat each word as a keyword. When this attribute is 1, the keywords list may include many words as a single keyword. When is set to 2, the delimiter between words is not a space. To search for such a keyword, write the words between quotes in the query string.

Note: Changing this parameter with $kw->set() after regexp has been built, causes the regexp to be rebuilt from parsed_query.

Note: When set to 0 or 2, a query with strings in quotes could match a keyword list if each word is present in the list, side by side in the same order.

parsed_query

Contains the query in the internal boolean format, which is required to build the regexp.

partial_words

By default (value of 0), only words that match exactly would return true when a keywords list is tested. Set this attribute to 1 if you want to match lists where keywords contains words from the query.

For example, "word" will match if a list contains "words", but "query" won't match "queries".

Note: Changing this parameter with $kw->set() after regexp has been built, causes the regexp to be rebuilt from parsed_query.

Note: Setting both partial_words and multi_words to 1 could return unexpected results on tests, because just first and last words will be considered to be partial strings only from the outside.

query

Contains the original query in the free-style syntax.

regexp

Contains the regular expresion built for the object's query. It's a qr/.../ value!

texted_ops

AND, OR and NOT operators are represented by some punctuation chars. In default mode (0), any use one of that words would try to match it in the keywords list. Set this attribute to 1 to allow words AND, OR and NOT to be used as binary operators in query expressions instead of keywords to match.

Note: Changing this attribute with $kw->set() after a regexp has been built, forces a query to be reparsed into parsed_query and regexp to be rebuilt.

KEYWORD LISTS ^

A Keyword is a combination of letters, underlines and numbers (/\w+/ pattern). Sometimes, more than one word can be used to create a keyword, and a space is between them.

Keyword lists are string values with words, usually delimited by comma or any other punctuation sign. Spaces may also appear surrounding them.

There is no validation for field names inside a keywords list. In fact, that names are also treated as keywords by themselves (see Tricks).

QUERY EXPRESSIONS ^

A query expression is a list of keywords with some operators surrounding them to provide simple boolean conditions.

Query expressions are in the form of:

  term1 & term2              # AND operator
  term1 | term2              # OR operator
  !term1                     # NOT operator
  "term one"                 # multi-word keyword
  term1 & ( term2 | term3 )  # Grouping changes precedence

All spaces are optional in query expressions, except for those in multi-word keywords when quoted.

Expression Terms

A term is one of the following:

Operators

To allow the words "AND", "OR" and "NOT" to be treated as operators, set the texted_ops parameter.

Grouping

Precedence is as usual: NOT has the highest, then AND, and OR has the lowest.

Precedence order in a query expression can be changed with the use of parenteses. For example:

  word1 | word2 & word3

is the same as:

  word1 | ( word2 & word3 )

but not as:

  ( word1 | word2 ) & word3

where word3 is required at the same time than either word1 or word2.

Is it possible to use NOT for a whole group, so the following two queries mean the same:

  +word1 -(word2,word3)
  +word1 -word2 -word3

Expresion groups can be nested. Also, "[...]", "{...}" and "<...>" can be used just like "(...)", but there is no validation for balanced parenteses by type, i.e. all of them gets translated into the same before the validation to detect an orphan one.

Tricks

INTERNAL BOOLEAN FORMAT ^

Queries in the free-style format are parsed and translated into an strict internal format. Note that space char is not allowed.

The elements of this format are:

Examples:

  tom&jerry|sylvester&tweety
  moe&(shemp|curly|joe)&larry
  popeye&olive&(!bluto&!brutus)
  hagar^the^horrible|popeye^the^sailor

Examples of bad queries:

  tom&jerry,sylvester&tweety
  moe(shemp|curly|joe)larry
  popeye&olive&!(bluto|brutus)
  ^the^

KNOWN LIMITATIONS ^

Currently, only ASCII chars are supported. No UTF-8, no Unicode, no accented vowels, no Kanji... Sorry!

AUTHOR ^

Victor Parada, <vitoco at cpan.org>

BUGS ^

Please report any bugs or feature requests to bug-regexp-keywords at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Regexp-Keywords. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT ^

You can find documentation for this module with the perldoc command.

    perldoc Regexp::Keywords

You can also look for information at:

ACKNOWLEDGEMENTS ^

Thank's to the Monks from the Monastery at http://www.perlmonks.org/.

COPYRIGHT & LICENSE ^

Copyright 2009 Victor Parada.

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

See http://dev.perl.org/licenses/ for more information.

syntax highlighting: