The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
package SWISH::Filters::Doc2txt;
use strict;
use vars qw( $VERSION @ISA );
$VERSION = '0.18';
@ISA = ('SWISH::Filters::Base');

sub new {
    my ($class) = @_;

    my $self = bless {
        mimetypes => [qr!application/(x-)?msword!]
        ,    # list of types this filter handles
        priority => 50
        ,   # Make this a higher number (lower priority than the wvware filter
    }, $class;

    # check for helpers
    return $self->set_programs('catdoc');

}

sub filter {
    my ( $self, $doc ) = @_;

    my $content = $self->run_catdoc( $doc->fetch_filename ) || return;

    # update the document's content type
    $doc->set_content_type('text/plain');

    # return the document
    return \$content;
}
1;

__END__

=head1 NAME

SWISH::Filters::Doc2txt - Perl extension for filtering MSWord documents with Swish-e

=head1 DESCRIPTION

This is a plug-in module that uses the "catdoc" program to convert MS Word documents
to text for indexing by Swish-e.  "catdoc" can be downloaded from:

    http://www.ice.ru/~vitus/catdoc/ver-0.9.html

The program "catdoc" must be installed and your PATH before running Swish-e.

=head1 BUGS

This filter does not specify input or output character encodings.  This will change in the
future to all use of the user_data to set the encoding.

A minor optimization during spidering (i.e. when docs are in memory instead of on disk)
would be to use open2() call to let catdoc read from stdin instead of from a file.

=head1 AUTHOR

Bill Moseley

=head1 SEE ALSO

L<SWISH::Filter>