The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Treex::Tool::Parser::MSTperl::FeaturesControl

VERSION

version 0.11949

DESCRIPTION

Controls the features used in the model.

Features

TODO: outdated, superceded by use of config file -> rewrite

Each feature has a form code:value. The code desribes the information which is relevant for the feature, and the value is the information retained from the dependency edge (and possibly other parts of the sentence (Treex::Tool::Parser::MSTperl::Sentence) stored in sentence field).

For example, the feature L|l:být|pes means that the lemma of the parent node (the governing word) is "být" and the lemma of its child node (the dependent node) is "pes".

Each (proper) feature is composed of several simple features. In the aforementioned example, the simple feature codes were L and l and their values "být" and "pes", respectively. Each simple feature code is a string (case sensitive) and its value is also a string. The simple feature codes are joined together by the | sign to form the code of the proper feature, and similarly, the simple feature values joined by | form the proper feature value. Then, the proper feature code and value are joined together by :. (Therefore, the codes and values of the simple features must not contain the | and the : signs.)

By a naming convention, if the same simple feature can be computed for both the parent node and its child node, their codes are the same but for the case, which is upper for the parent and lower for the child. If this is not applicable, an uppercase code is used.

For higher effectiveness the simple feature codes are translated to integers (see simple_feature_codes).

In reality the feature codes are translated to integers as well (see feature_codes), but this is only an internal issue. You can see these numbers in the model file if you use the default Data::Dumper format (see load and store). However, if you use the tsv format (see load_tsv, store_tsv), you will see the real string feature codes.

Currently the following simple features are available. Any subset of them can be used to form a proper feature, but their order should follow their order of appearance in this list (still, this is only a cleanliness and readability thing, it does not affect the function of the parser in any way).

Distance (D)

Distance of the two nodes in the sentence, computed as order of the parent minus the order of the child. Eg. for the sentence "To je prima pes ." and the feature D computed on nodes "je" and "pes" (parent and child respectively), the order of "je" is 2 and the order of "pes" is 4, yielding the feature value of 2 - 4 = -2. This leads to a feature D:-2.

Form (F, f)

The form of the node, i.e. the word exactly as it appears in the sentence text.

Currently not used as it has not lead to any improvement in the parsing.

Lemma (L, l)

The morphological lemma of the node.

preceding tag (S, s)

The morphological tag (or POS tag if you like) of the node preceding (ord-wise) the node.

Tag (T, t)

The morphological tag of the node.

following tag (U, u)

The morphological tag of the node following (ord-wise) the node.

between tag (B)

The morphological tag of each node between (ord-wise) the parent node and the child node. This simple feature returns (a reference to) an array of values.

Some of the simple features can return an empty string in case they are not applicable (eg. U for the last node in the sentence), then the whole feature is not present for the edge.

Some of the simple features return an array of values (eg. the B simple feature). This can result in several instances of the feature with the same code for one edge to appear in the result.

FIELDS

Features

TODO: slightly outdated

The examples used here are consistent throughout this part of documentation, i.e. if several simple features are listed in simple_feature_codes and then simple feature with index 9 is referred to in array_simple_features, it really means the B simple feature which is on the 9th position in simple_feature_codes.

feature_count (Int)

Alias of scalar @{feature_codes} (but the integer is really stored in the field for faster access).

feature_codes (ArrayRef[Str])

Codes of all features to be computed. Their indexes in this array are used to refer to them in the code. Eg.:

 feature_codes ( [( 'L|T', 'l|t', 'L|T|l|t', 'T|B|t')] )
feature_codes_hash (HashRef[Str])

1 for each feature code to easily check if a feature exists

feature_indexes (HashRef[Str])

Index of each feature code in feature_codes (for conversion of feature code to feature index)

feature_simple_features_indexes (ArrayRef[ArrayRef[Int]])

For each feature contains (a reference to) an array which contains all its simple feature indexes (corresponding to positions in simple_feature_codes ). Eg. for the 4 features (0 to 3) listed in feature_codes and the 10 simple features listed in simple_feature_codes (0 to 9):

 feature_simple_features_indexes ( [(
   [ (1, 5) ],
   [ (2, 6) ],
   [ (1, 5, 2, 6) ],
   [ (5, 9, 6) ],
 )] )
array_features (HashRef)

Indexes of features containing array simple features (see array_simple_features). Eg.:

 array_features( { 3 => 1} )

as the feature with index 3 ('T|B|t') contains the B simple feature which is an array simple feature.

Simple features

simple_feature_count (Int)

Alias of scalar @{simple_feature_codes} (but the integer is really stored in the field for faster access).

simple_feature_codes (ArrayRef[Str])

Codes of all simple features to be computed. Their order is important as their indexes in this array are used to refer to them in the code, especially in the get_simple_feature method. Eg.:

 simple_feature_codes ( [('D', 'L', 'l', 'S', 's', 'T', 't', 'U', 'u', 'B')])
simple_feature_codes_hash (HashRef[Str])

1 for each simple feature code to easily check if a simple feature exists

simple_feature_indexes (HashRef[Str])

Index of each simple feature code in simple_feature_codes (for conversion of simple feature code to simple feature index)

simple_feature_sub_arguments (ArrayRef)

For each simple feature (on the corresponsing index) contains the index of the field (in field_names), which is used to compute the simple feature value (together with a subroutine from simple_feature_subs).

If the simple feature takes more than one argument (called a multiarg feature here), then instead of a single field index there is a reference to an array of field indexes.

If the simple feature takes other arguments than fields (especially integers), then these arguments are stored here insted of field indexes.

simple_feature_subs (ArrayRef)

For faster run, the simple features are internally not represented by their string codes, which would have to be parsed repeatedly. Instead their codes are parsed once only (in set_simple_feature) and they are represented as an integer index of the field which is used to compute the feature (it is the actual index of the field in the input file line, accessible through "fields" in Treex::Tool::Parser::MSTperl::Node) and a reference to a subroutine (one of the feature_* subs, see below) which computes the feature value based on the field index and the edge (Treex::Tool::Parser::MSTperl::Edge). The references subroutine is then invoked in get_simple_feature_values_array.

array_simple_features (HashRef[Int])

Indexes of simple features that return an array of values instead of a single string value. Eg.:

 array_simple_features( { 9 => 1} )

because in the aforementioned example the B simple feature returns an array of values and has the index 9.

Other

edge_features_cache (HashRef[ArrayRef[Str])

If caching is turned on (see below), all features of any edge computed by the get_feature_simple_features_indexes method are computed once only, stored in this cache and then retrieved when needed.

The key of the hash is the edge signature (see "signature" in Treex::Tool::Parser::MSTperl::Edge), the value is (a reference to) an array of fetures and their values.

METHODS

Settings

The best source of information about all the possible settings is the configuration file itself (usually called config.txt), as it is richly commented and accompanied by real examples at the same time.

my $featuresControl = Treex::Tool::Parser::MSTperl::FeaturesControl->new( 'config' => $config, 'feature_codes_from_config' => $feature_codes_array_reference, 'use_edge_features_cache' => $use_edge_features_cache, )

Parses feature codes and creates their in-memory representations.

set_feature ($feature_code)

Parses the feature code and (if no errors are encountered) creates its representation in the fields of this package (all feature_* fields and possibly also the array_features field).

set_simple_feature ($simple_feature_code)

Parses the simple feature code and creates its representation in the fields of this package (all simple_feature_* fields and possibly also the array_simple_features field).

Computing (proper) features

my $features_array_rf = $model->get_all_features($edge)

Returns (a reference to) an array which contains all features of the edge (according to settings).

If caching is turned on, tries to look the features up in the cache before computing them. If they are not cached yet, they are computed and stored into the cache.

The value of a feature is computed by get_feature_value. Values of simple features are precomputed (by calling get_simple_feature_values_array) and passed to the get_feature_value method.

my $feature_value = get_feature_value(3, $simple_feature_values)

Returns the value of the feature with the given index.

If it is an array feature (see array_features), its value is (a reference to) an array of all (string) values of the feature (a reference to an empty array if there are no values).

If it is not an array feature, its value is composed from the simple feature values. If some of the simple features do not have a value defined, an empty string ('') is returned.

my $feature_value = get_array_feature_value ($simple_features_indexes, $simple_feature_values, $start_from)

Recursively calls itself to compose an array of all values of the feature (composed of the simple features given in $simple_features_indexes array reference), which is a cartesian product on all values of the simple features. The $start_from variable should be 0 when this method is called and is incremented in the recursive calls.

Computing simple features

my $simple_feature_values = get_simple_feature_values_array($edge)

Returns (a reference to) an array of values of all simple features (see simple_feature_codes). For each simple feature, its value can be found on the position in the returned array corresponding to its position in simple_feature_codes.

my $sub = get_simple_feature_sub_reference ('distance')

Translates the feature funtion string name (eg. distance) to its reference (eg. \&feature_distance).

my $value = get_simple_feature_value ($edge, 9)

Returns the value of the simple feature with the given index by calling an appropriate feature_* method on the edge (see Treex::Tool::Parser::MSTperl::Edge). If the feature cannot be computed, an empty string ('') is returned (or a reference to an empty array for array simple features - see array_simple_features).

feature_distance
feature_child
feature_parent
feature_first
feature_second
feature_preceding_child
feature_preceding_parent
feature_following_child
feature_following_parent
feature_preceding_first
feature_preceding_second
feature_following_first
feature_following_second
feature_between
feature_foreach
feature_equals, feature_equals_pc, feature_equals_pc_at

A simple feature function equals(field_1,field_2) with "at least once" semantics for multiple values (there can be multiple alignments) with a special output value if one of the fields is unknown (maybe it suffices to emmit an undef, as this would occur iff at least one of the arguments is undef; but maybe not and eg. "-1" should be given)

This makes it possible to have a simple feature which behaves like this:

returns 1 if the edge between child and parent is also present in the English tree
returns 0 if not
returns -1 if cannot decide (alignment info is missing for some of the nodes)

Because if the parser has (the ord of the en child node and) the ord of en child's parent and the ord of the en parent node (and the ord of the en parent's parent), the feature can check whether en_parent->ord = en_child->parentOrd

equalspc(en-ord, en->parent->ord)>

AUTHORS

Rudolf Rosa <rosa@ufal.mff.cuni.cz>

COPYRIGHT AND LICENSE

Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.