create_summary_corpus.pl - Script to create corpus for summary testing.
create_summary_corpus.pl [-d corpusDirectory -l languageCode -p maxProcesses -h -t n]
The script create_summary_corpus.pl makes a corpus for summarization testing using the featured articles of various Wikipedias.
create_summary_corpus.pl
All errors and warnings are logged using Log::Log4perl to the file corpusDirectory/languageCode/log.txt.
corpusDirectory/languageCode/log.txt
-d corpusDirectory
The option -d sets the directory to store the corpus of documents; the directory is created if it does not exist. The default is the cwd.
-d
cwd
A language subdirectory is created at corpusDirectory/languageCode that will contain the directories log, html, unparsable, text, and xml. The directory log will contain the file log.txt that all errors, warnings, and informational messages are logged to using Log::Log4perl. The directory html will contain copies of the HTML versions of the featured article pages fetched using LWP. The directory text will contain two files for each article; one file will end with _body.txt and contain the body text of the article, the other will end with _summary.txt and will contain the summary. The directory unparsable will contain the HTML files that could not be parsed into body and summary sections. The XML files are UTF8 encoded, the text and html files are saved as UTF8 octets.
corpusDirectory/languageCode
log
html
unparsable
text
xml
log.txt
_body.txt
_summary.txt
-l languageCode
The option -l sets the language code of the Wikipedia from which the corpus of featured articles are to be created. The supported language codes are af:Afrikaans, ar:Arabic, az:Azerbaijani, bg:Bulgarian, bs:Bosnian, ca:Catalan, cs:Czech, de:German, el:Greek, en:English, eo:Esperanto, es:Spanish, eu:Basque, fa:Persian, fi:Finnish, fr:French, he:Hebrew, hr:Croatian, hu:Hungarian, id:Indonesian, it:Italian, ja:Japanese, jv:Javanese, ka:Georgian, kk:Kazakh, km:Khmer, ko:Korean, li:Limburgish, lv:Latvian, ml:Malayalam, mr:Marathi, ms:Malay, mzn:Mazandarani, nl:Dutch, nn:Norwegian (Nynorsk), no:Norwegian (Bokm?l), pl:Polish, pt:Portuguese, ro:Romanian, ru:Russian, sh:Serbo-Croatian, simple:Simple English, sk:Slovak, sl:Slovenian, sr:Serbian, sv:Swedish, sw:Swahili, ta:Tamil, th:Thai, tl:Tagalog, tr:Turkish, tt:Tatar, uk:Ukrainian, ur:Urdu, vi:Vietnamese, vo:Volap?k, and zh:Chinese. If the language code is all, then the corpus for each supported language is created (which takes a long time). The default is en.
-l
af
ar
az
bg
bs
ca
cs
de
el
en
eo
es
eu
fa
fi
fr
he
hr
hu
id
it
ja
jv
ka
kk
km
ko
li
lv
ml
mr
ms
mzn
nl
nn
no
pl
pt
ro
ru
sh
simple
sk
sl
sr
sv
sw
ta
th
tl
tr
tt
uk
ur
vi
vo
zh
all
-p maxProcesses
maxProcesses => 1
The option -p is the maximum number of processes that can be running simultaneously to parse the files. Parsing the files for the summary and body sections may be computational intensive so the module Forks::Super is used for parallelization. The default is one.
-p
-r
Causes only the text and XML files from all the HTML files that have already been fetched to be created; no new files are downloaded.
-h
Makes this documentation print.
-t 0
The option -t initiates testing mode; only the specified number of pages are fetched and parsed. The default is zero, indicating no testing, all possible pages are fetched and parsed.
-t
This script creates corpora by parsing Wikipedia pages, the xpath expressions used to extract links and text will become invalid as the format of the various pages changes, causing some corpora not to be created.
Please email bugs reports or feature requests to bug-text-corpus-summaries-wikipedia@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Corpus-Summaries-Wikipedia. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.
bug-text-corpus-summaries-wikipedia@rt.cpan.org
Jeff Kubina<jeff.kubina@gmail.com>
Copyright (c) 2010 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The full text of the license can be found in the LICENSE file included with this module.
corpus, information processing, summaries, summarization, wikipedia
Forks::Super, Log::Log4perl, Text::Corpus::Summaries::Wikipedia
Links to the featured article page for the supported language codes: af:Afrikaans, ar:Arabic, az:Azerbaijani, bg:Bulgarian, bs:Bosnian, ca:Catalan, cs:Czech, de:German, el:Greek, en:English, eo:Esperanto, es:Spanish, eu:Basque, fa:Persian, fi:Finnish, fr:French, he:Hebrew, hr:Croatian, hu:Hungarian, id:Indonesian, it:Italian, ja:Japanese, jv:Javanese, ka:Georgian, kk:Kazakh, km:Khmer, ko:Korean, li:Limburgish, lv:Latvian, ml:Malayalam, mr:Marathi, ms:Malay, mzn:Mazandarani, nl:Dutch, nn:Norwegian (Nynorsk), no:Norwegian (Bokm?l), pl:Polish, pt:Portuguese, ro:Romanian, ru:Russian, sh:Serbo-Croatian, simple:Simple English, sk:Slovak, sl:Slovenian, sr:Serbian, sv:Swedish, sw:Swahili, ta:Tamil, th:Thai, tl:Tagalog, tr:Turkish, tt:Tatar, uk:Ukrainian, ur:Urdu, vi:Vietnamese, vo:Volap?k, and zh:Chinese.
Copies of the data sets generated in May 2010 and February 2013 can be download here.
To install Text::Corpus::Summaries::Wikipedia, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Corpus::Summaries::Wikipedia
CPAN shell
perl -MCPAN -e shell install Text::Corpus::Summaries::Wikipedia
For more information on module installation, please visit the detailed CPAN module installation guide.