Plucene::SearchEngine::Index - A higher level abstraction for Plucene
my $indexer = Plucene::SearchEngine::Index->new( dir => "/var/lib/plucene" ); my @documents = map { $_->document } Plucene::SearchEngine::Index::File->examine("foo.html"); $indexer->index($_) for @documents;
This module makes it easy to write to Plucene indexes. It does so by providing an interface to the index writer which, in terms of complexity, sits between Plucene::Index::Writer and Plucene::Simple; it also provides a framework of modules for turning data into Plucene::Document objects, so that you don't necessarily have to parse them yourself. See "Document Frontends and Backends" for more on this.
Plucene::Index::Writer
Plucene::Simple
Plucene::Document
Designed to be used with Plucene::SearchEngine::Query, these two modules aim to make it easy for anyone writing search engines based on Plucene.
my $indexer = Plucene::SearchEngine::Index->new( dir => "/var/plucene/foo", analyzer => "Plucene::Analysis::SimpleAnalyzer", );
This creates a new indexer; you must specify the directory to contain the index, and you may specify an analyzer to tokenize the data.
This adds a Plucene::Document to the index.
So far so good, but how do you create these Plucene::Documents? You can, of course, do so manually, but the easiest way is to use the supplied Plucene::SearchEngine::Index::File or Plucene::SearchEngine::Index::URL modules.
Plucene::Documents
Plucene::SearchEngine::Index::File
Plucene::SearchEngine::Index::URL
These two modules are frontends which gather metadata about a file or URL and then hand the data off to one of the backend modules - there are backends supplied for PDF, HTML and plain text files. These in turn return a list of documents found in the file or URL. In most cases, there'll only be one document, but, for instance, a Unix mbox should return an object for each email in the box. These objects can be turned into Plucene::Document objects by calling the document method on them. This isn't done by default because you may wish to mess with the hash yourself, or serialize it, or whatever.
document
If you want to handle a different type of file, it's relatively easy to do. All you need to do is create a module called Plucene::SearchEngine::Index::Whatever; this should inherit from Plucene::SearchEngine::Index::Base and supply a gather_data_from_file method. It should also call the register_handler method to state which MIME types and file extensions it can handle.
Plucene::SearchEngine::Index::Whatever
Plucene::SearchEngine::Index::Base
gather_data_from_file
register_handler
For instance, suppose we want to create a backend which grabs metadata from an image and indexes that. (Not unlike Plucene::SearchEngine::Index::Image...) We'd start off like this:
package Plucene::SearchEngine::Index::Image; use strict; use warnings; use base 'Plucene::SearchEngine::Index::Base'; use Image::Info;
Now we register the mime types and file extensions we can handle:
__PACKAGE__->register_handler(qw( image/bmp .bmp image/gif .gif image/jpeg .jpeg .jpg .jpe ... ));
And our gather_data_from_file method will call add_data for each bit of metadata it can find:
add_data
sub gather_data_from_file { my ($self, $filename) = @_; my $info = image_info($filename); return if $info->{error}; $self->add_data("size", "UnStored", scalar html_dim($info)); $self->add_data("text", "UnStored", $info->{Comment}); $self->add_data("subtype", "UnStored", $info->{file_ext}); $self->add_data("created", "Date", Time::Piece->new( str2time($info->{LastModificationTime}))); }
See Plucene::SearchEngine::Index::Base for an explanation of add_data.
Beceause Plucene::SearchEngine::Index uses a plugin architecture, once this module is installed, it will automatically be called upon to handle those image types it can deal with, without any additional action by the user.
Plucene::SearchEngine::Index
For certain types of data, such as emails, news articles, or instant messages, you may not want to use the file or URL frontends. Alternatively, if you have a simple piece of data which isn't file-based, you may just want to do everything yourself. Even then, Plucene::SearchEngine::Index::Base can help you to create Plucene::Documents - just inherit from it, and use add_data to add fields to the document in your examine method. See Plucene::SearchEngine::Index::Base for more details.
examine
Plucene::SearchEngine::Index::File, Plucene::SearchEngine::Index::URL, Plucene::SearchEngine::Index::Base, Plucene::SearchEngine::Query, Plucene::Simple.
Simon Cozens simon@cpan.org.
simon@cpan.org
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Plucene::SearchEngine, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Plucene::SearchEngine
CPAN shell
perl -MCPAN -e shell install Plucene::SearchEngine
For more information on module installation, please visit the detailed CPAN module installation guide.