The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

SWISH::Prog - information retrieval application framework

SYNOPSIS

  use SWISH::Prog;
  my $program = SWISH::Prog->new(
    invindex    => 'path/to/myindex',
    aggregator  => 'fs',
    indexer     => 'native',
    config      => 'some/swish/config/file',
    filter      => sub { print $_[0]->url . "\n" },
  );
                
  $program->run('some/dir');
  
  print $program->count . " documents indexed\n";
          

DESCRIPTION

SWISH::Prog is a full-text search framework based on Swish-e (http://swish-e.org/).

SWISH::Prog tries to fill a niche similar to Data::SearchEngine or DBI: providing a uniform and flexible interface to several different search engine tools and libraries.

SWISH::Prog does not try to replace the use of the underlying search engine tools, but instead tries to fill in some usability gaps and, like the DBI, make it relatively easy to switch between backend tools without needing to re-write an entire codebase.

SWISH::Prog implements all five basic components of a search application:

Aggregator

Gather a document collection. A collection might be a group of HTML pages, or XML documents, or rows in a database. A collection might originate from the web, a filesystem, a database, an email inbox, or anywhere bytes are stored. An Aggregator gathers those documents in a uniform way.

SWISH::Prog provides a variety of Aggregators, for filesystems, email trees, spidering the web, pulling from databases, to name a few. See SWISH::Prog::Aggregator and its subclasses.

Normalizer

Documents come in a variety of formats (MIME types). A Normalizer turns those disparate types into something text-based and parseable. SWISH::Prog uses SWISH::Filter to normalize documents.

Parser/Analyzer

Documents are tokenized into "words" with attention to position, context, length, encoding, and linguistic quality (stemming, case, stopwords, etc.).

With the exception of the Native classes, SWISH::Prog uses SWISH::3 to parse HTML and XML documents (the most common normalized format for SWISH::Filter), and then delegates further analysis (tokenization, etc) to backend tools or libraries.

Indexer

Each SWISH::Prog::Indexer subclass fronts an information retrieval (IR) tool or library that implements its own proprietary, highly optimized inverted index storage system that preserves the intelligence of the Parser/Analyzer.

For example, the SWISH::Prog::Lucy::Indexer is a wrapper around Lucy::Index::Indexer. SWISH::Prog::Native::Indexer is a wrapper around the swish-e tool.

Searcher

Like the Indexer, each SWISH::Prog::Searcher subclass delegates the searching of the inverted index to the backend IR tool or library.

For example, the SWISH::Prog::Lucy::Searcher is a wrapper around Lucy::Search::PolySearcher. SWISH::Prog::Native::Searcher is a wrapper around the SWISH::API::More module.

BACKGROUND

The name "SWISH::Prog" comes from the Swish-e -S prog feature. "prog" is short for "program". SWISH::Prog makes it easy to write indexing and search programs.

SWISH::Prog started as a way of making the swish-e binary tool easier to integrate into Perl applications, and has since been expanded as a full implementation of Swish3, with alternate backend libraries (KinoSearch, Xapian, Apache Lucy, etc) filling the Indexer and Searcher roles.

METHODS

All of the following methods may be overridden when subclassing this module.

init

Overrides base SWISH::Prog::Class init() method.

filter( CODE ref )

Set in new(). See SWISH::Prog::Doc.

Example:

 my $prog = SWISH::Prog->new(
     filter => {
        my $doc = shift;
    
        # alter url
        my $url = $doc->url;
        $url =~ s/my.foo.com/my.bar.org/;
        $doc->url( $url );
    
        # alter content
        my $buf = $doc->content;
        $buf =~ s/foo/bar/gi;
        $doc->content( $buf );
    }
 );

The filter value can also be the name of a file that evals to a CODE ref.

aggregator( $swish_prog_aggregator )

Get the SWISH::Prog::Aggregator object. You should set this in new().

aggregator_opts

Get the hashref of options passed internally to the aggregator constructor.

indexer_opts

Get the hashref of options passed internally to the indexer constructor.

run( collection )

Execute the program. This is an alias for index().

index( collection )

Add items in collection to the invindex().

config

Returns the aggregator's config() object.

invindex

Returns the indexer's invindex.

indexer

Returns the indexer.

count

Returns the indexer's count. NOTE This is the number of documents actually indexed, not counting the number of documents considered and discarded by the aggregator. If you want the number of documents the aggregator looked at, regardless of whether they were indexed, use the aggregator's count() method.

test_mode

Dry run mode, just prints info on stderr but does not build index. This flag is set in new() and passed to the indexer and aggregator.

AUTHOR

Peter Karman, <perl@peknet.com>

BUGS

Please report any bugs or feature requests to bug-swish-prog at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=SWISH-Prog. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc SWISH::Prog

You can also look for information at:

COPYRIGHT AND LICENSE

Copyright 2008-2009, 2012 by Peter Karman

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

SEE ALSO

http://swish-e.org/

SWISH::Prog::Doc, SWISH::Prog::Headers, SWISH::Prog::Indexer, SWISH::Prog::InvIndex, SWISH::Prog::Utils, SWISH::Prog::Aggregator, SWISH::Prog::Config