Lars Nygaard > Text-Identify-BoilerPlate > rem-boilerplate-text

Download:
Text-Identify-BoilerPlate-0.3.1.tar.gz

Annotate this POD

CPAN RT

New  1
Open  0
View/Report Bugs
Source  

NAME ^

rem-boilerplate-text

VERSION ^

Version 0.2

SYNOPSIS ^

        > rem-boilerplate-text [options] <list of files>

E.g.

        > rem-boilerplate-text --min_dupl=6 intranet/txt/*.txt 

DESCRIPTION ^

Removes repeated text from a set of files.

Note that the system only works when more than one file is specified, since boilerplate text is detected based on repetition across files.

New files are written, with a suffix appended to the original filenames.

OPTIONS ^

-m, --min_dupl

The minimum number of thimes a line has to occur to be considered boilerplate (default: 3). Can be either an integer or a percentage ('50 %') of the number of files processed. Minimum value: 2.

-i, --ignore_digits

Lines only seperated by differences in digits will be considered duplicates (default: yes).

-s, --suffix

Added to the new files (default: 'content').

-o, --only_headers_and_footers

Only sets consecutive lines of duplicates at the start and end of documents are considered boilerplate (default: yes).

-d, digest

Lines will be replaced by a MD5 digest during duplicate compilation, saving memory (default: no).

-l, log

Name of the log file, where deleted lines are recorded; if set to false, no log will be created (default: './text-identify-boilerplate.log').

-h, --help

Display usage information.

-v, --verbose

Be verbose.

AUTHOR ^

Lars Nygaard, <lars.nygaard@inl.uio.no>

COPYRIGHT & LICENSE ^

Copyright 2005 Lars Nygaard, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: