Text::FromAny - a module to read pure text from a vareiety of formats
my $tFromAny = Text::FromAny->new(file => '/some/text/file'); my $text = $tFromAny->text;
Text::FromAny can currently read the following formats:
Portable Document format - PDF Legacy/binary MSWord .doc OpenDocument Text Legacy OpenOffice.org writer "Office Open XML" text Rich text format - RTF (X)HTML Plaintext
Attributes can be supplied to the new constructor, as well as set by running object->attribute(value). The "file" attribute MUST be supplied during construction.
The file to read. MUST be supplied during construction time (and can not be changed later). Can be any of the supported formats. If it is not of any supported format, or an unknown format, the object will still work, though ->text will return undef.
This is a boolean, defaulting to true. If Text::FromAny is unable to properly detect the filetype it will fall back to guessing the filetype based upon the file extension. Set this to false to disable this.
The default for allowGuess is subject to change in later versions, so if you depend on it being either on or off, you are best off explicitly requesting that behaviour, rather than relying on the defaults.
This is a boolean, defaulting to false. If the perl-based PDF reading method fails (PDF::CAM), then Text::FromAny will fall back to calling the system pdftotext(1) to get the text. PDF::CAM reads most PDFs, but has troubles with a select few, and those can be handled by pdftotext(1) from the Poppler library.
The default for allowExternal is subject to change in later versions, so if you depend on it being either on or off, you are best off explicitly requesting that behaviour, rather than relying on the defaults.
Returns the text contained in the file, or undef if the file format is unknown or unsupported.
Normally Text::FromAny will only read the file once, and then cache the text. However if you change the value of either the allowGuess or allowExternal attributes, Text::FromAny will re-read the file, as those can affect how a file is read.
Returns the detected filetype (or undef if unknown or unsupported). The filetype is returned as a string, and can be any of the following:
pdf => PDF odt => OpenDocument text sxw => Legacy OpenOffice.org Writer doc => msword docx => "Open XML" rtf => RTF txt => Cleartext html => HTML (or XHTML)
Please report any bugs or feature requests to http://github.com/portu/Text-FromAny/issues.
Eskild Hustvedt, <firstname.lastname@example.org>
Copyright (C) 2010 by Eskild Hustvedt
This library is free software; you can redistribute it and/or modify it under the terms of either:
a) the GNU General Public License as published by the Free Software Foundation; either version 3, or (at your option) any later version, or b) the "Artistic License" which comes with this Kit.
This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See either the GNU General Public License or the Artistic License for more details.
You should have received a copy of the Artistic License in the file named "COPYING.artistic". If not, I'll be glad to provide one.
You should also have received a copy of the GNU General Public License along with this library in the file named "COPYING.gpl". If not, see <http://www.gnu.org/licenses/>.