File::Extract - Extract Text From Arbitrary File Types
use File::Extract; my $e = File::Extract->new(); my $r = $e->extract($filename); my $e = File::Extract->new(encodings => [...]); my $class = "MyExtractor"; File::Extract->register_processor($class); my $filter = MyCustomFilter->new; File::Extact->register_filter($mime_type => $filter);
File::Extract is a framework to extract text data out of arbitrary file types, useful to collect data for indexing.
Registers a new text-extractor. The processor is used as the default processor for a given MIME type, but it can be overridden by specifying the 'processors' parameter
The specified class needs to implement two functions:
Returns the MIME type that $class can extract files from.
Extracts the text from $file. Returns a File::Extract::Result object.
Registers a filter to be used when a particular mime type has been found.
Returns the File::MMagic::XS object that used by the object. Use this to modify, set options, etc. E.g.:
my $extract = File::Extract->new(...); $extract->magic->add_file_ext(t => 'text/perl-test'); $extract->extract(...);
A hashref of filters to be applied before attempting to extract the text out of it.
Here's a trivial example that puts line numbers in the beginning of each line before extracting the output out of it.
use File::Extract; use File::Extract::Filter::Exec; my $extract = File::Extract->new( filters => { 'text/plain' => [ File::Extract::Filter::Exec->new(cmd => "perl -pe 's/^/\$. /'") ] } ); my $r = $extract->extract($file);
A list of processors to be used for this instance. This overrides any processors that were registered previously via register_processor() class method.
List of encodings that you expect your files to be in. This is used to re-encode and normalize the contents of the file via Encode::Guess.
The final encoding that you the extracted test to be in. The default encoding is UTF8.
File::MMagic::XS
Copyright 2005-2007 Daisuke Maki <daisuke@endeworks.jp>. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
See http://www.perl.com/perl/misc/Artistic.html
To install File::Extract, copy and paste the appropriate command in to your terminal.
cpanm
cpanm File::Extract
CPAN shell
perl -MCPAN -e shell install File::Extract
For more information on module installation, please visit the detailed CPAN module installation guide.