pdfgetext - get text from pdf and resort to ocr as needed
Get all text out of a pdf, even from images.
This is basically a CLI interface to OCR::PDF::Thorough.
-f force extracting images and running ocr even if pdftotext finds content -d debug on -o output file, abs path (text file) instead of STDOUT
Standard usage:
pdfgetext /home/myself/brochure.pdf
If you want to save to a text file
pdfgetext -o /home/myself/brochure.txt /home/myself/brochure.pdf
If you want to see extra debug info:
pdfgetext -d /home/myself/brochure.pdf
Another way to save to a text file
pdfgetext /home/myself/brochure.pdf > /home/myself/output
PDF::OCR PDF::OCR::Thorough PDF::API2
Leo Charre leocharre at cpan dot org
To install PDF::OCR, copy and paste the appropriate command in to your terminal.
cpanm
cpanm PDF::OCR
CPAN shell
perl -MCPAN -e shell install PDF::OCR
For more information on module installation, please visit the detailed CPAN module installation guide.