The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

File::Tabular::Web::Attachments::Indexed - Fulltext indexing in documents attached to File::Tabular::Web

DESCRIPTION

This abstract class adds support for fulltext indexing in documents attached to a File::Tabular::Web application.

Queries into the fulltext index should be passed under the SFT ("search full text") parameter, in addition to the usual S parameter (search in metadata record). So for example

  http://my/app.ftw?S=2007&SFT=perl

will search records containing the word "2007" and having an attached document in which there is the word "perl". Queries can of course be much more complex, with boolean operators, parentheses, excluded words, etc. --- see Search::Indexer and Query::Parser.

Indexing requires some mechanism to convert attached documents into plain text. This cannot be guessed by the present class, so you should write a subclass that implements such conversions; see the "SUBCLASSING" section below.

RESERVED FIELD NAMES

Records retrieved from a fulltext search will have two additional fields : score (how well the document matched the query) and excerpts (strings of text fragments close to the searched words). Therefore those field names should not be present as regular fields in the data file.

CONFIGURATION

[fields]

  upload fieldname1
  upload fieldname2 = indexed

Currently only one single upload field can be indexed within a given application.

subclassing

This class relies on the "indexed_doc_content" method for converting attached documents into plain text, which is a prerequisite to perform the indexing. The default implementation of "indexed_doc_content" just returns the raw file content, so it is most likely inappropriate to suit your needs; therefore you should write a subclass that overrides this method, and then associate this subclass to your application within the configuration file :

  [application]
  class = My::Subclass::Of::File::Tabular::Web::Attachements::Indexed

Asynchronous indexing

If your uploaded documents are Microsoft Office or OpenOffice documents, it may be too costly to convert them on the fly, while answering the HTTP request. A way to deal with this is to override the "after_add_attachment" and "before_delete_attachment" methods : instead of performing immediate adds or deletions into the index, these method can write indexing requests into an event queue. A separate process then reads the event queue and performs the indexing operations.

METHODS

app_initialize

Calls the parent method; records in $self->{app}{indexed_field} which is the name of the indexed field.

words_queried

Returns a list of words queried either in the S or SFT parameters.

Logs both the S and SFT parameters.

Performs the fulltext search, and combines the results into the usual search string coming from the S parameter.

Calls the parent method and adds a score field into each record.

sort_and_slice

Calls the parent method and adds excerpts of the searched words from attached documents into each record of the slice.

add_excerpts

Implementation to find excerpts of searched word within attached documents and add them into the result set.

params_for_next_slice

Returns a string repeating the search parameters, for generating URLs to the next or previous slice.

after_add_attachment

Performs the indexing of the attached document

before_delete_attachment

Removes the document from the index.

indexed_doc_content

  my $plain_text = $self->indexed_doc_content($record);

Returns the plain text representation of the document attached to $record. To get to the actual file, your implementation can access

  my $path = $self->upload_fullpath($record, $self->{indexed_field});