The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Search::Kinosearch - Search Engine Library

WARNING

Search::Kinosearch is ALPHA test software. Aspects of the interface are almost certain to change, given the suite's complexity. Users should not count on compatibility of files created by Kinosearch when future versions are released -- any time you upgrade Kinosearch (or possibly, a module such as Lingua::Stem::Snowball on which Kinosearch depends), you should expect to regenerate all kindex files from scratch.

SYNOPSIS

First, write an application to build a 'kindex' from your document collection. (A 'kindex' is a Kinosearch index.)

    use Search::Kinosearch::Kindexer;
    
    my $kindexer = Search::Kinosearch::Kindexer->new(
        -mainpath => '/foo/bar/kindex',
        );
    
    for my $field ('title', 'bodytext') {
        $kindexer->define_field(
            -name      => $field,
            -lowercase => 1,
            -tokenize  => 1,
            -stem      => 1,
            );
    }

    while (my ($title, $bodytext) = each %docs) {
        my $doc = $kindexer->new_doc( $title );
        
        $doc->set_field( title    => $title    );
        $doc->set_field( bodytext => $bodytext );
        
        $kindexer->add_doc( $doc );
    }
    
    $kindexer->generate;
    $kindexer->write_kindex;

Then, write a second application to search the kindex:

    use Search::Kinosearch::KSearch;
    use Search::Kinosearch::Query;
     
    my $ksearch = Search::Kinosearch::KSearch->new(
        -mainpath   => '/foo/bar/kindex',
        );
    
    my $query = Search::Kinosearch::Query->new(
        -string     => 'this AND NOT (that OR "the other thing")',
        -lowercase  => 1,
        -tokenize   => 1,
        -stem       => 1,
        );
    $ksearch->add_query( $query );
    $ksearch->process;
    
    while (my $hit = $ksearch->fetch_hit_hashref) {
        print "$hit->{title}\n";
    }

DESCRIPTION

Primary Features

  • Match 'any' or 'all' search terms

  • Match phrases

  • Boolean operators AND, OR, and AND NOT

  • Support for parenthetical groupings of arbitrary depth

  • Prepended +plus or -minus to require or negate a term

  • Sort results by relevance or by datetime

  • Stemming

  • Algorithmic selection of relevant excerpts

  • Hilighting of search terms in excerpts, even when stemmed

  • Fast, efficient algorithms for both indexing and searching

  • Field-specific queries

  • Works well with large or small document collections

  • Incremental indexing

  • mod_perl optimizations

  • High quality ranking algorithm based on term frequency / inverse document frequency (tf/idf)

General Introduction

Search::Kinosearch (hereafter abreviated 'Kinosearch') is a search engine library -- it handles the tricky, behind-the-scenes work involved in running a search application. It is up to you how to present the search results to the end-user.

Kinosearch has two main parts: Search::Kinosearch::Kindexer and Search::Kinosearch::KSearch (hereafter abbreviated 'Kindexer' and 'KSearch'). When you want to know which pages of a book are most relevant for a given subject (e.g. avocados, Aesop, Alekhine's Defense...) you look up the term in the book's index and it tells you which pages may be of interest to you; the 'kindex' produced by Kinosearch's Kindexer performs an analogous role -- using the interface tools provided by the KSearch module, you consult the kindex to find which documents within the document collection Kinosearch considers most relevant to your query.

[The Search::Kinosearch module itself doesn't do very much, and does nothing on its own; this documentation is an overview which ties together the various components of the Kinosearch suite.]

The Kindexer thinks of your documents as database rows, each with as many fields as you define. HTML documents might be parsed, prepared, and fed to the Kindexer like so:

  1. Store the URL in a kindex field called 'url'.

  2. Store the text surrounded by the <title> tags in a kindex field called 'title';

  3. Isolate portions of the document that are not relevant to content (such as navigation panels or advertisements) and remove them.

  4. Strip all html tags.

  5. Store what's left in a field called 'bodytext'.

Most of the time, you will want to take advantage of three crucial functions performed by the Kindexer, executed on a per-field basis depending on the parameters passed to define_field():

-lowercase

This does exactly what you would expect - lc the text to be indexed (but leave the copy of the text to be stored intact). If you select the same option for your queries at search-time, your searches will be case-insensitive.

-tokenize

Tokenizing breaks up input text into an array of words (roughly speaking). If you tokenize the text "Dr. Strangelove, or How I Learned to Stop Worrying and Love the Bomb", then searches for "Strangelove" or "Bomb" will match. If you don't tokenize it, then only a search for the complete string "Dr. Strangelove, or How I Learned to Stop Worrying and Love the Bomb" will match.

-stem

Stemming reduces words to a root form. For instance, "horse", "horses", and "horsing" all become "hors" -- so that a search for 'horse' will also match documents containing 'horses' and 'horsing'. For more information, see the documentation for Lingua::Stem.

Once you have the completed the indexing phase, you will need a search application. Most often, this takes the form of a CGI script accessed via web browser. The interface design requirements of such an app will be familiar to anyone who's surfed the web.

Getting started

If you want to get started right away, copy the sample applications out of Search::Kinosearch::Tutorial, get them functioning, and then swap in some of your own material.

You may wish to consult the documentation for Search::Kinosearch::Kindexer, Search::Kinosearch::KSearch and Search::Kinosearch::Query before continuing on to the next section.

Fine Tuning Kinosearch

Performance Optimizations

The bottleneck in search applications is the ranking and sorting of a large number of relevant documents. Minimizing the time it takes to return results is a central design goal of Kinosearch.

The single most important optimization available for Kinosearch apps is to store either some or all of the kindex files in ram -- in particular, the files stored within the directory specified by the '-freqpath' parameter. These files contain the data upon which has the greatest impact upon search-time speed. Storing the other kindex files in ram won't hurt, but will not yield anywhere near the same benefit.

An additional search-time optimization is available when running Kinosearch applications under mod_perl. See the Search::Kinosearch::Kindex documentation for details.

Stoplists

A "stoplist" is collection of "stopwords": words which are common enough to be of little use when determining search results. For example, so many documents in English contain "the", "if", and "maybe" that it typically improves both performance and relevance to block them at search time. (Some search engine libaries also exclude stopwords during the indexing process to save space; Kinosearch always indexes all terms, because if it did not, phrase matching, excerpt generation, and excerpt highlighting would break.) It is possible to disable the default stoplist used by KSearch or specify a different one, if you so desire.

Ranking Algorithm

Kinosearch uses a variant of the well-established "Term Frequency / Inverse Document Frequency" weighting scheme. Explaining TF/IDF is beyond the scope of this documentation, but in a nutshell:

  • in a search for "skate park", documents which score well for the comparatively rare term "skate" will rank higher than documents which score well for the common term "park".

  • a 10-word text which has one occurrence each of both "skate" and "park" will rank higher than a 1000-word text which also contains one occurrence of each.

A web search for "tf idf" will turn up many excellent explanations of the algorithm.

Field Weights

Kinosearch will allow you to weight the title of a document more heavily than the bodytext -- or vice versa -- by assigning a weight to each field at either index-time or search-time. However, the multipliers don't have to be that large, because under TF/IDF a single occurrence of a word within the 10-word title will automatically contribute more to the score of a document than a single occurrence of the same word in the 1000-word bodytext. See the documentation for Kindexer's define_field() method for more details.

TO DO

  • Refine the API.

  • Increase use of data compression.

  • Dispense with DB_File altogether.

  • Implement a composite index format.

  • Enable multi-lingual support.

  • Enable customizable tokenizing and stemming.

  • Implement docrank.

  • Complete support for UTF-8. At present, everything works so long as the encoding of all supplied files is ASCII compatible and is consistent across all files.

  • Add unicode normalization.

  • Add support for other encodings.

  • Implement -cache_level option for Search::Kinosearch::Kindexer, in order to allow user control over memory requirements.

  • Implement descending term weighting.

  • Implement alternative backends.

  • Add support for 64-bit timestamps.

  • Enable support for psuedo-stemming using Lingua::Spelling::Alternative for languages where no Snowball stemmer is available.

BUGS

  • Excerpt may be up to 20 characters longer than requested.

FILES

Kinosearch requires several other CPAN distributions:

SEE ALSO

ACKNOWLEDGEMENTS

Chris Nandor has been helpful with debugging.

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT

Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.