Lingua::LinkParser::MatchPath - Match paths in linkage diagrams
# If you need to see debugging messages, please uncomment the line below # $Lingua::LinkParser::MatchPath::DEBUG = 1; use Lingua::LinkParser; my $parser = new Lingua::LinkParser; # Initialization my $o = Lingua::LinkParser::MatchPath->new( parser => $parser ); my $o = Lingua::LinkParser::MatchPath->new( ); # Create matchers according to given templates. foreach my $t (@templates){ $o->create_matcher($t); } my $sentence = $parser->create_sentence($text); # Start matching foreach (@{$o->matcher}){ print $_->template,$/; if($_->match($sentence)){ print "Extracted: ", join( q/, /, $m->item(0, 1)), $/; } }
This module can help check if a linkage path exists in a linkage diagram generated by Lingua::LinkParser, and can help parse English texts.
my $o = Lingua::LinkParser::MatchPath->new( parser => $parser );
You can pass in a Link-Grammar-parser object, or you can leave it blank as
my $o = Lingua::LinkParser::MatchPath->new();
, a parser will be established for you with basic settings.
$o->create_matcher($template);
This method creates matcher objects in the backend according to the given template.
foreach (@{$o->matcher()}){ # blah blah here. }
This method returns a list of created matcher using create_matcher(), and you can use the following matcher methods.
It returns the template set with the matcher.
After matcher gets initiated, we can call match() to see if it can match any path in linkages of our sentences. The sentence can be either a sentence object
match()
$parser = new Lingua::LinkParser; $matcher->match($parser->create_sentence($sentence));
or a simple text. A sentence object will be built automatically.
$matcher->match($sentence);
item() is reserved to retrieve link labels and words along the path. We can pass arguments specifying which items we would like to get. The index counts from 0.
item()
@item = $matcher->item(); # retrieve all of matched items @item = $matcher->item(0, 2, 3); # retrieve 0, 2, 3 @item = $matcher->item(1, 3..5); # retrieve 1, 3, 4, 5
Please see below for detail.
The remaining part of this document will show us how to use the simple but powerful template language.
Begin in words. End in words.
Given a sentence : 'Gunther sees Rachel.', and here is a linkage diagram generated by link parser.
+-------------Xp------------+ +---Wd---+---Ss--+--Os--+ | | | | | | LEFT-WALL Gunther sees.v Rachel .
And now, the goal is to form a template to match the sentence and to extract the words on the linking path.
If we have a template like this:
Gunther <Ss> sees <Os> Rachel ^ ^ ^ ^ ^ | | | | | WORD LINK WORD LINK WORD
In this example, the path matcher will first locate the position of Gunther, and check if one of Gunther's linkages contains the label Ss. If Ss exists, the matcher will continue to see if sees is further linked by Ss. The process goes on until matcher reaches full matching or it fails.
Here, this template will match the sentence successfully.
For the definitions of link labels, please go to http://www.link.cs.cmu.edu/link/dict/index.html
If we have two sentences:
Ross bites Monica.
and
Joey bites Monica too.
The diagrams are as follows respectively:
+------------Xp-----------+ +---Wd--+--Ss-+---Os--+ | | | | | | LEFT-WALL Ross bites.v Monica . +--------------Xp-------------+ | +-----MVa----+ | +---Wd--+--Ss-+---Os--+ | | | | | | | | LEFT-WALL Joey bites.v Monica too .
There is no need to build two templates:
Ross <Ss> bites <Os> Monica
Joey <Ss> bites <Os> Monica
Instead, we can combine them two into one using regexp (regular expression), and it becomes
/Ross|Joey/ <Ss> bites <Os> Monica
Also, we can add a case-insensitive modifier to regexps.
/ROSS|JOEY/i <Ss> bites <Os> Monica
Our regexp fully complies with perl's regexp. For regexp tutorial, please see perlretut.
There is a situation in which we are sure that some words must belong to a certain class of words in order to satisfy the template, and then POS (part-of-speech) tag can be used for that.
Given a linkage like this,
+---------------------------Xp---------------------------+ | +----I*d---+------Osn------+ | +---Wd---+--Ss-+--N-+ +--K-+ +---Ds--+---Mp--+-Js+ | | | | | | | | | | | | LEFT-WALL Monica did.v not blow.v up the apartment.n of Ross .
We write a template like this,
/^Monica/ <I*d> blow <Osn> apartment <Mp> of <Js> Ross
and then we will need to duplicate a house of templates for matching and miss many linkages with the same structures
Besides using regexps, we can also use POS tags to generalize our templates in this situation.
/^Monica/ <I*d> _v_ <Osn> _n_ <Mp> of <Js> Ross
or even
/.+?/ <I*d> _v_ <Osn> _n_ <Mp> of <Js> /.+?/
Supported tags are v for verb, a for adjective, d for determiner, p for pronoun, n for noun, etc.
The POS tags attached to words in the above diagram are auto-identified by LinkParser. This POS-tag feature of pathmatcher is only valid with identified classes.
Regexp can not only be used with words, but with link labels too.
Let's take the above template as an example.
If we change the template into this one,
/.+?/ </^I/> _v_ </^O/> _n_ </^M/> of </^J/> /.+?/
then the link labels with I, O, M, J as their first characters will be matched.
Here we introduce our defined branching operator, with which we are able to write branching templates. This is designed to match multiple link labels emitted from a word. Otherwise, the pointer will march on to the next word and continue the matching process.
One common situation is negation. Here we use two simple sentences with opposite semantics to illustrate this situation:
someone is here
no one is here.
And their diagrams:
+-----------Xp----------+ +---Wd---+--Ss--+-Pp-+ | | | | | | LEFT-WALL someone is.v here . +-----------Xp-----------+ +-----Wd----+ | | +-Ds+-Ss-+-Pp-+ | | | | | | | LEFT-WALL no.d one is.v here .
They are semantically opposite, but both are fit into a common structure:
/one$/ -> <Ss> -> is -> <Pp> -> here
If we merely write template as
/one$/ <Ss> is <Pp> here
, then both linkages will be matched despite their different semantics. This is not usually what we want. Usually, we hope to seperate these two types of semantics, and that is why we introduce the branch label. With it, this problem is simply solved.
The branch label comes in positive and negative types.
Positive type implies if we have a certain branch emitted from the current word, then matching is successful; negative one implies successful matching if we do NOT have a certain branch emitted from the current word.
Appending a # to the front of a label is indicating the label is tagged as a positive branch, and ! as a negative one.
Then now, we can write down our templates to match the two different semantics.
For the first case, we don't want to see <Ds> no in the diagram
<Ds> no
/one$/ !<Ds> no <Ss> is <Pp> here. /one$/ !( <Ds> no ) <Ss> is <Pp> here.
For the second case, we must see a <Ds> no.
/one$/ #<Ds> no <Ss> is <Pp> here /one$/ #( <Ds> no ) <Ss> is <Pp> here
Of course, the second template can also be written as
no <Ds> one <Ss> is <Pp> here
, but it loses the flavor of branching operator and deviates from the educational intention.
Another type of branching, called 'grouping', is introduced here, with which we can write optional paths for a template.
/John/ @( <Ss> _v_ | <AN> _a_ )
In this case, the matcher will first try to match <Ss> and _v_ after successfully matching John. If it fails, it will try <AN> and _a_ later. @ is used for grouping, with which we can group various template paths together into one. If parentheses without anything appended to the front, @ will be appended.
Another operator is designed to capture desired words. In one of the above examples,
%/^Monica/ <I*d> _v_ <Osn> _n_ <Mp> of <Js> Ross
if we add % to some of the word templates like
%/^Monica/ <I*d> %_v_ <Osn> %_n_ <Mp> of <Js> Ross
, then call item(). The method will return 'blow' and 'apartment' for this example here. This feature is useful for further processing.
A finer word capturing can be done using (). In the above example,
If we parenthesize Mon in Monica as
%/^(Mon)ica/ <I*d> %_v_ <Osn> %_n_ <Mp> of <Js> Ross
After a successful matching, we can get Mon calling $matcher->item(0).
The grammar of the template language is listed, and the full grammar with semantic actions are in etc/Grammar.y
START -> RULE END_OF_RULE; RULE -> WORD_PATTERN LINKS; LINKS -> LINK | LINK LINKS | PLINKS | LINKS OR LINKS | # PLINKS LINKS | @ PLINKS LINKS | ! PLINKS LINKS | _EPSILON_; PLINKS -> ( LINKS ); LINK -> LABEL_PATTERN WORD_PATTERN; LABEL_PATTERN -> LABEL | LABEL_REGULAR_EXPRESSION; WORD_PATTERN -> WORD_ATOM | % WORD_ATOM; WORD_ATOM -> WORD | WORD_REGULAR_EXPRESSION | POS_TAG | ! WORD | ! WORD_REGULAR_EXPRESSION | ! POS_TAG;
The module cannot handle isolated linkages yet, but patches are always welcome. I also need to clean up some part of code. Besides, the interface is so bad for now.
Lingua::LinkParser
Copyright (C) 2004 by Yung-chung Lin (a.k.a. xern) <xern@cpan.org>
This library is free software; Redistribution and/or modification under the same terms as Perl itself is allowed.
To install Lingua::LinkParser::MatchPath, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::LinkParser::MatchPath
CPAN shell
perl -MCPAN -e shell install Lingua::LinkParser::MatchPath
For more information on module installation, please visit the detailed CPAN module installation guide.