The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DBIx::KwIndex - create and maintain keyword indices in DBI tables

SYNOPSIS

 package MyKwIndex;
 use DBIx::KwIndex;

 sub document_sub { ... }

 package main;
 $kw = DBIx::KwIndex->new({dbh => $dbh, index_name => 'myindex'})
   or die "can't create index";

 $kw->add_document   ([1,2,3,...]) or die $kw->{ERROR};
 $kw->remove_document([1,2,3,...]) or die $kw->{ERROR};
 $kw->update_document([1,2,3,...]) or die $kw->{ERROR};

 $docs = $kw->search({ words=>'upset stomach' });
 $docs = $kw->search({ words=>'upset stomach', boolean=>'AND' });
 $docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 });
 $docs = $kw->search({ words=>['upset','(bite|stomach)'], re=>1 });

 $kw->add_stop_word(['the','an','am','is','are']) or die $kw->{ERROR};
 $words = $kw->common_word(85);
 $kw->remove_word(['gingko', 'bibola']) or die $kw->{ERROR};

 $ndocs  = $kw->document_count();
 $nwords = $kw->word_count();
 $ndocs  = $kw->match_count({ words=>'upset stomach', boolean=>'OR' });

 $kw->remove_index or die $kw->{ERROR};
 $kw->empty_index  or die $kw->{ERROR};

DESCRIPTION

DBIx::KwIndex is a keyword indexer. It indexes documents and stores the index data in database tables. You can tell DBIx::KwIndex to index [lots] of documents and later on show you which ones contain a certain word. The typical application of DBIx::KwIndex is in a search engine.

How to use this module:

1. Provide a database handle.
 use DBI;
 my $dbh = DBI->connect(...) or die $DBI::errstr;
2. Subclass DBIx::KwIndex and provide a `document_sub' method to retrieve documents referred by an integer id. The method should accept a list of document ids in an array reference and return the documents in an array reference. In this way, you can index any kind of documents that you want: text files, HTML files, BLOB columns, etc., as long as you provide the suitable document_sub() to retrieve the documents. The one thing to remember is that the documents must be referred by unique integer number. Below is a sample of a document_sub() that retrieves document from the 'content' field of a database table.
 package MyKwIndex;
 require DBIx::KwIndex;
 use base 'DBIx::KwIndex';

 sub document_sub {
    my ($self, $ary_ref) = @_;
        my $dbh = $self->{dbh};

    my $result = $dbh->selectall_arrayref(
    'SELECT id,content FROM documents
     WHERE id IN ('. join(',',@$ary_ref). ')');
    
    # if retrieval fails, you should return undef
    defined($result) or return undef;
    
    # now returns the content field in the order of the id's
    # requested. remember to return the documents exactly 
    # in the order requested!
    my %tmp = map { $_->[0] => $_->[1] } @$result;
    return [ @tmp{ @$ary_ref } ];
 }
3. Create the indexer object.
 my $kw = MyKwIndex->new({
          dbh => $dbh,
          index_name => 'article_index',
          # other options...
          });

dbh is the database handle. index_name is the name of the index, DBIx::KwIndex will create several tables which are all prefixed with the index_name. The default index_name is 'kwindex'. Other options include: max_word_length (default 32).

4. Index some documents. You can index one document at a time, e.g.
 $kw->add_document([1]) or die $kw->{ERROR};
 $kw->add_document([2]) or die $kw->{ERROR};

or small batches of documents at a time:

 $kw->add_document([1..10])  or die $kw->{ERROR};
 $kw->add_document([11..20]) or die $kw->{ERROR};

or large batches of documents at a time:

 $kw->add_document([1..300])   or die $kw->{ERROR};
 $kw->add_document([301..600]) or die $kw->{ERROR};

Which one to choose is a matter of memory-speed trade-off. Larger batches will increase the speed of indexing, but with increased memory usage.

Note: DBIx::KwIndex ignores single-character words, numbers, and words longer than 'max_word_length'.

5. If you want to search the index, use the search() method.
 $docs = $kw->search({ words => 'upset stomach' });
 die "can't search" if !defined($docs);

The search() method will return an ARRAY ref containing the document ids that matches the criteria. Other parameter include: num => maximum number of results to retrieve; start => starting position (1 = from the beginning); boolean => 'AND' or 'OR' (default is 'OR'); re => use regular expression, 1 or 0.

Note: num and start uses the LIMIT clause (which is quite unique to MySQL). re uses the REGEXP clause. Do not use these options if your database server does not support them.

Also note: Searching is entirely done from the index. No documents will be retrieved while searching. A simple 'relevancy' ranking is used. Search is case-insensitive and there is no phrase-search support yet.

Some examples:

 # retrieve only the 11th-20th result.
 $docs = $kw->search({ words=>'upset stomach', start=>11, num=>10 });
 die "can't search" if !defined($docs);

 # find documents which contains all the words.
 $docs = $kw->search({ words=>'upset stomach', boolean=>'AND' });
 die "can't search" if !defined($docs);

If you just want to know how many documents match your query, use the match_count() method. If you want to retrieve all the matches anyway, to know how many documents match just find out the size of the match array. That will save you from extra index access.

 # find the number of documents that match the query
 $ndocs = $kw->match_count({ words=>'halitosis' });

 # find the number of matches and retrieve only the first twenty of them
 $query = { words=>'halitosis', num=>20 };
 $ndocs = $kw->match_count($query);
 $docs  = $kw->search($query);

 # search and get the matches. get the number of matches from the
 # result set itself.
 $docs  = $kw->match_count({ words=>'halitosis' });
 $ndocs = @$docs;
6. Now suppose some documents change, and you need to update the index to reflect that. Just use the methods below.
 # if you want to remove documents from index 
 $kw->remove_document([90..100]) or die $kw->{ERROR};

 # if you want to update the index
 $kw->update_document([90..100]) or die $kw->{ERROR};

SOME UTILITY METHODS

If you want to exclude some words (usually very common words, or "stop words") from being indexed, do this before you index any document:

 $kw->add_stop_word(['the','an','am','is','are'])
   or die "can't add stop words";

Adding stop words is a good thing to do, as stop words are not very useful for your index. They occur in a large proportion of documents (they do not help searches differentiate documents) and they increase the size your index (slowing the searches).

But which words are common in your collection? you can use the common_word method:

 $words = $kw->common_word(85);

This will return an array reference containing all the words that occur in at least 85% of all documents (default is 80%).

If you want to delete some words from the index:

 $kw->remove_word(['common','cold']);
   or die "can't remove words";

To get some statistics about your index:

 # the number of documents
 $ndocs = $kw->document_count();
 # the number of words
 $nwords = $kw->word_count();

Last, if you got bored with the index and want to delete it:

 $kw->remove_index or die $kw->{ERROR};

This will delete the database tables. Or, if you just want to empty the index and start all over:

 $kw->empty_index or die $kw->{ERROR};

AUTHOR

Steven Haryanto <steven@haryan.to>

COPYRIGHT

Copyright (c) 2000 Steven Haryanto. All rights reserved.

You may distribute under the terms of either the GNU General Public License or the Artistic License, as specified in the Perl README file.

BUGS/CAVEATS/TODOS

Enable the module to use other database server (besides MySQL). MySQL-specific SQL bits that need to be adjusted include, but not limited to: LIMIT clause, LOCK/UNLOCK TABLES statements, REPLACE INTO/INSERT IGNORE statements, COUNT(DISTINCT ...) group function, AUTO_INCREMENT and INT/UNSIGNED. (Don't you just hate SQL? :-) Thanks to Edwin Pratomo for pointing these out). Currently I do not need this feature, since I only use MySQL for current projects. Any volunteer?

Use a more correct search sorting (the current one is kinda bogus :).

Probably implement phrase-searching (but this will require a larger vectorlist).

Probably, maybe, implement English/Indonesian stemming.

Any safer, non database-specific way to test existence of tables other than $dbh->tables?

NOTES

At least three other Perl extensions exist for creating keyword indices and storing them in a database: DBIx::TextIndex, MyConText, DBIx::FullTextSearch. You might want to take a look at them before you decide which one better suits your need. Personally I develop DBIx::KwIndex because I want to have a module that: a) is simple and convenient to use; b) supports updating the index without rebuilding it from scratch.

Incidentally, these three extensions and DBIx::KwIndex itself use MySQL specifically. One that could use Interbase or Postgres would perhaps be nice.

Advices/comments/patches welcome.

HISTORY

0001xx=first draft,satunet.com. 000320=words->scalar. 000412=0.01/documentation/cpan. 000902=update doc/fixes(see Changes)