However, around 30% of the clauses in Wikipedia are passive clauses, and in many cases a person is referred to by a pronoun. We want to ensure that "A number of family members were painted by Rembrandt" gives the same factoid as "Rembrandt painted a number of family members" and that for "Rembrandt painted Biblical scenes" the same factoid is generated as for "Rembrandt, who painted Biblical scenes". For cases like these, our factoid extractor performs a number of transformations to the input clauses. We implemented the following transformations:
Passive-to-active: Passive clauses are transformed to active clauses, in which the subject from the passive clause takes the object position. If there is no actor in the sentence, the subject slot is filled with the empty actor 'MEN' (ONE).
"De luchthaven werd op 8 juli 1964 geopend" The airport was opened on July 8th, 1964 MEN|open|de luchthaven|op 8 juli 1964
Modifier-to-subject: If a passive clause contains a modifier starting with 'door' (by) then this modifier is moved to the subject slot, e.g.
"De instrumenten werden opnieuw ingespeeld door de bandleden" The instruments were recorded again by the band members" de bandleden|speel_in|de instrumenten|opnieuw
Copula-to-definition: If the verb of a clause is a copular verb (e.g. become), then the object of the clause is considered to be a description of the subject. These factoids are transformed to definitions with the verb IS.
"Rome werd opnieuw de hoofdstad van Italië Rome became the capital of Italy again Rome|IS|de hoofdstad van Italië|opnieuw
Double-object-to-definition: For clauses that have two objects, a factoid is generated that connects both objects, e.g.
"De behandeling van Crohn wordt symptomatisch genoemd" The treatment of Crohn's disease is called symptomatic de behandeling van Crohn|IS|symptomatisch|
Pron-to-np: If the subject or object of a clause is a relative pronoun, then we substitute it by the most recent noun phrase. This is a very local form of anaphora resolution.
"De voornaamste vertegenwoordiger was Rembrandt, die veel Bijbelse taferelen schilderde." The main representative was Rembrandt, who painted many Biblical scenes. de voornaamste vertegenwoordiger|IS|Rembrandt
Rembrandt|schilder|veel Bijbelse taferelen|
For sentences that consist of multiple clauses, multiple factoids are generated, e.g.
"Voor de onafhankelijkheid was Bangalore een belangrijke industriestad; meer recent is het een belangrijk centrum van de informatietechnologie in India geworden en wordt het wel de Silicon Valley van India genoemd." Before its independence, Bangalore was an important industry town; more recently it became an important centre of information technology in India and it is called the Silicon Valley of India.
Bangalore|IS|een belangrijke industriestad|Voor de onafhankelijkheid
het|IS|een belangrijk centrum van de informatietechnologie in India|meer recent
MEN|noem|het & de Silicon Valley van India|meer recent & wel
het|IS|de Silicon Valley van India
If punctuation such as a full stop or a comma is glued to a word in the Alpino output then this punctuation also ends up in the factoids extracted from the sentence. Work-around is to use a tokenizer that separates punctuation from words by whitespace before parsing the sentence.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.7 or, at your option, any later version of Perl 5 you may have available.