update_Text_Corpus_VoiceOfAmerica.pl - Script to update VOA news article corpus.
update_Text_Corpus_VoiceOfAmerica.pl [-d corpusDirectory -c -t -h]
All errors and warnings are logged using Log::Log4perl to the file
-d sets the cache directory for the corpus of documents. If the directory does not exist, it will be created. The default is a directory named
'corpus_voa' in the current working directory.
If the option
-t is present, parsing tests will be performed on all the documents in the cache.
If the option
-v is present, then after each new document is fetched a message is logged stating the number of documents remaining to fetch and the approximate time to completion.
Causes documentation to be printed.
This script uses xpath expressions to extract links and text which may become invalid as the format of various pages change, causing a lot of bugs.
Please email bugs reports or feature requests to
firstname.lastname@example.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Text-Corpus-VoiceOfAmerica. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.