Suzan Verberne > Lingua-NL-FactoidExtractor-1.4 > Lingua::NL::FactoidExtractor

Download:
Lingua-NL-FactoidExtractor-1.4.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 1.4   Source  

NAME ^

Lingua::NL::FactoidExtractor - A tool for extracting factoids from Dutch texts

SYNOPSIS ^

    use strict;
    use lib "./lib";
    use Lingua::NL::FactoidExtractor;

    my $inputfile = "alpino.xml";
    my $verbose = 1; #boolean
    my $factoids = extract($inputfile,$verbose);

    print "$factoids\n";

PREREQUISITES ^

The Dutch parser Alpino is a prerequisite for this module. Alpino is available under the conditions of the Gnu Lesser General Public License. See The Alpino Home Page.

DESCRIPTION ^

However, around 30% of the clauses in Wikipedia are passive clauses, and in many cases a person is referred to by a pronoun. We want to ensure that "A number of family members were painted by Rembrandt" gives the same factoid as "Rembrandt painted a number of family members" and that for "Rembrandt painted Biblical scenes" the same factoid is generated as for "Rembrandt, who painted Biblical scenes". For cases like these, our factoid extractor performs a number of transformations to the input clauses. We implemented the following transformations:


For sentences that consist of multiple clauses, multiple factoids are generated, e.g.

"Voor de onafhankelijkheid was Bangalore een belangrijke industriestad; meer recent is het een belangrijk centrum van de informatietechnologie in India geworden en wordt het wel de Silicon Valley van India genoemd."
Before its independence, Bangalore was an important industry town; more recently it became an important centre of information technology in India and it is called the Silicon Valley of India.

Bangalore|IS|een belangrijke industriestad|Voor de onafhankelijkheid
het|IS|een belangrijk centrum van de informatietechnologie in India|meer recent
MEN|noem|het & de Silicon Valley van India|meer recent & wel
het|IS|de Silicon Valley van India

KNOWN ISSUES ^

If punctuation such as a full stop or a comma is glued to a word in the Alpino output then this punctuation also ends up in the factoids extracted from the sentence. Work-around is to use a tokenizer that separates punctuation from words by whitespace before parsing the sentence.

AUTHOR ^

Suzan Verberne, http://sverberne.ruhosting.nl

COPYRIGHT AND LICENSE ^

Copyright (C) 2012 by Suzan Verberne

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.7 or, at your option, any later version of Perl 5 you may have available.

CREDITS ^

This work was funded by Google by means of a European Digital Humanities Award.

syntax highlighting: