File::Tabular::Web::Attachments::Indexed - Fulltext indexing in documents attached to File::Tabular::Web
This abstract class adds support for fulltext indexing in documents attached to a File::Tabular::Web application.
Queries into the fulltext index should be passed under the SFT ("search full text") parameter, in addition to the usual S parameter (search in metadata record). So for example
SFT
S
http://my/app.ftw?S=2007&SFT=perl
will search records containing the word "2007" and having an attached document in which there is the word "perl". Queries can of course be much more complex, with boolean operators, parentheses, excluded words, etc. --- see Search::Indexer and Query::Parser.
Indexing requires some mechanism to convert attached documents into plain text. This cannot be guessed by the present class, so you should write a subclass that implements such conversions; see the "SUBCLASSING" section below.
Records retrieved from a fulltext search will have two additional fields : score (how well the document matched the query) and excerpts (strings of text fragments close to the searched words). Therefore those field names should not be present as regular fields in the data file.
score
excerpts
upload fieldname1 upload fieldname2 = indexed
Currently only one single upload field can be indexed within a given application.
This class relies on the "indexed_doc_content" method for converting attached documents into plain text, which is a prerequisite to perform the indexing. The default implementation of "indexed_doc_content" just returns the raw file content, so it is most likely inappropriate to suit your needs; therefore you should write a subclass that overrides this method, and then associate this subclass to your application within the configuration file :
[application] class = My::Subclass::Of::File::Tabular::Web::Attachements::Indexed
If your uploaded documents are Microsoft Office or OpenOffice documents, it may be too costly to convert them on the fly, while answering the HTTP request. A way to deal with this is to override the "after_add_attachment" and "before_delete_attachment" methods : instead of performing immediate adds or deletions into the index, these method can write indexing requests into an event queue. A separate process then reads the event queue and performs the indexing operations.
Calls the parent method; records in $self->{app}{indexed_field} which is the name of the indexed field.
$self->{app}{indexed_field}
Returns a list of words queried either in the S or SFT parameters.
Logs both the S and SFT parameters.
Performs the fulltext search, and combines the results into the usual search string coming from the S parameter.
Calls the parent method and adds a score field into each record.
Calls the parent method and adds excerpts of the searched words from attached documents into each record of the slice.
Implementation to find excerpts of searched word within attached documents and add them into the result set.
Returns a string repeating the search parameters, for generating URLs to the next or previous slice.
Performs the indexing of the attached document
Removes the document from the index.
my $plain_text = $self->indexed_doc_content($record);
Returns the plain text representation of the document attached to $record. To get to the actual file, your implementation can access
$record
my $path = $self->upload_fullpath($record, $self->{indexed_field});
To install File::Tabular::Web, copy and paste the appropriate command in to your terminal.
cpanm
cpanm File::Tabular::Web
CPAN shell
perl -MCPAN -e shell install File::Tabular::Web
For more information on module installation, please visit the detailed CPAN module installation guide.