NAME

Text::Corpus::VoiceOfAmerica::Document - Parse a VOA article for research.

SYNOPSIS

  use Cwd;
  use File::Spec;
  use Text::Corpus::VoiceOfAmerica;
  use Data::Dump qw(dump);
  use Log::Log4perl qw(:easy);
  Log::Log4perl->easy_init ($INFO);
  my $corpusDirectory = File::Spec->catfile (getcwd(), 'corpus_voa');
  my $corpus = Text::Corpus::VoiceOfAmerica->new (corpusDirectory => $corpusDirectory);
  $corpus->update (verbose => 1);
  my $document = $corpus->getDocument (index => 0);
  dump $document->getBody;
  dump $document->getCategories;
  dump $document->getContent;
  dump $document->getDate;
  dump $document->getDescription;
  dump $document->getTitle;
  dump $document->getUri;

DESCRIPTION

Text::Corpus::VoiceOfAmerica::Document provides methods for accessing the content of VOA news articles for the researching and testing of information processing techniques. Read the Voice of America's Terms of Use statement to ensure you abide by it when using this module.

CONSTRUCTOR

`new`

The constructor new creates an instance of the Text::Corpus::VoiceOfAmerica::Document class with the following parameters:

htmlContent

  htmlContent => '...'

htmlContent is a string of the HTML of the document to be parsed.

uri

  uri => '...'

uri is the URL of the HTML content provided by htmlContent; it is also returned as the documents unique identifier by getUri.

METHODS

`getBody`

 getBody ()

getBody returns an array reference of strings of sentences that are the body of the article.

`getCategories`

  getCategories ()

getCategories returns an array reference of strings of categories assigned to the article. They are the phrases and words from the /html/head/meta[@name="KEYWORDS"] field in the HTML of the document.

`getContent`

 getContent ()

getContent returns an array reference of strings of sentences that form the content of the article, the title and body of the article.

`getDate`

 getDate (format => '%g')

getDate returns the date and time of the article in the format speficied by format that uses the print directives of Date::Manip::Date. The default is to return the date and time in RFC2822 format.

`getDescription`

  getDescription ()

getDescription returns an array reference of strings of sentences, usually one, that describes the articles content. It is from the /html/head/meta[@name="description"] field in the HTML of the document.

`getEncoding`

  getEncoding ()

getEncoding returns the original encoding used by the HTML of the document.

`getHtml`

  getHtml ()

getHtml returns the HTML of the document as a string.

`getTitle`

 getTitle ()

getTitle returns an array reference of strings, usually one, of the title of the article.

`getUri`

  getUri ()

getUri returns the URL of the document.

INSTALLATION

For installation instructions see Text::Corpus::VoiceOfAmerica.

BUGS

This module uses xpath expressions to extract links and text which may become invalid as the format of various pages change, causing a lot of bugs.

Please email bugs reports or feature requests to text-corpus-voiceofamerica@rt.cpan.org, or through the web interface at http://rt.cpan.org/Public/Bug/Report.html?Queue=Text-Corpus-VoiceOfAmerica. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.

AUTHOR

 Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

corpus, english corpus, information processing, voa, voice of america

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

CONSTRUCTOR

new

METHODS

getBody

getCategories

getContent

getDate

getDescription

getEncoding

getHtml

getTitle

getUri