Daisuke Maki > File-Extract > File::Extract

Download:
File-Extract-0.07000.tar.gz

Dependencies

Annotate this POD (1)

CPAN RT

New  7
Open  0
View/Report Bugs
Module Version: 0.07000   Source  

NAME ^

File::Extract - Extract Text From Arbitrary File Types

SYNOPSIS ^

  use File::Extract;
  my $e = File::Extract->new();
  my $r = $e->extract($filename);

  my $e = File::Extract->new(encodings => [...]);

  my $class = "MyExtractor";
  File::Extract->register_processor($class);

  my $filter = MyCustomFilter->new;
  File::Extact->register_filter($mime_type => $filter);

DESCRIPTION ^

File::Extract is a framework to extract text data out of arbitrary file types, useful to collect data for indexing.

CLASS METHODS ^

register_processor($class)

Registers a new text-extractor. The processor is used as the default processor for a given MIME type, but it can be overridden by specifying the 'processors' parameter

The specified class needs to implement two functions:

mime_type(void)

Returns the MIME type that $class can extract files from.

extract($file)

Extracts the text from $file. Returns a File::Extract::Result object.

register_filter($mime_type, $filter)

Registers a filter to be used when a particular mime type has been found.

METHODS ^

new(%args)

magic

Returns the File::MMagic::XS object that used by the object. Use this to modify, set options, etc. E.g.:

  my $extract = File::Extract->new(...);
  $extract->magic->add_file_ext(t => 'text/perl-test');
  $extract->extract(...);
filters

A hashref of filters to be applied before attempting to extract the text out of it.

Here's a trivial example that puts line numbers in the beginning of each line before extracting the output out of it.

  use File::Extract;
  use File::Extract::Filter::Exec;

  my $extract = File::Extract->new(
    filters => {
      'text/plain' => [
        File::Extract::Filter::Exec->new(cmd => "perl -pe 's/^/\$. /'")
      ]
    }
  );
  my $r = $extract->extract($file);
processors

A list of processors to be used for this instance. This overrides any processors that were registered previously via register_processor() class method.

encodings

List of encodings that you expect your files to be in. This is used to re-encode and normalize the contents of the file via Encode::Guess.

output_encoding

The final encoding that you the extracted test to be in. The default encoding is UTF8.

extract($file)

SEE ALSO ^

File::MMagic::XS

AUTHOR ^

Copyright 2005-2007 Daisuke Maki <daisuke@endeworks.jp>. All rights reserved.

LICENSE ^

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

See http://www.perl.com/perl/misc/Artistic.html

syntax highlighting: