Treex::Tool::Parser::MSTperl::ModelLabelling
version 0.08055
This is an in-memory represenation of a labelling model, extended from Treex::Tool::Parser::MSTperl::ModelBase.
Fields inherited from Treex::Tool::Parser::MSTperl::ModelBase.
Instance of Treex::Tool::Parser::MSTperl::Config containing settings to be used for the model.
Currently the settings most relevant to the model are the following:
See "EM_EPSILON" in Treex::Tool::Parser::MSTperl::Config.
See "labeller_algorithm" in Treex::Tool::Parser::MSTperl::Config.
See "labelledFeaturesControl" in Treex::Tool::Parser::MSTperl::Config.
See "SEQUENCE_BOUNDARY_LABEL" in Treex::Tool::Parser::MSTperl::Config.
Provides access to labeller features, especially enabling their computation. Intance of Treex::Tool::Parser::MSTperl::FeaturesControl.
Emission scores for Viterbi. They follow the edge-based factorization and provide scores for various labels for an edge based on its features.
The structure is:
emissions->{feature}->{label} = score
Scores may or may not be probabilities, based on the algorithm used. Also based on the algorithm they may be MIRA-computed or they might be obtained by standard MLE.
Transition scores for Viterbi. They follow the first order Markov chain edge-based factorization and provide scores for various labels for an edge probably based on its features and always based on previous edge label.
Scores may or may not be probabilities, based on the algorithm used. Also based on the algorithm they may be obtained by standard MLE or they might be MIRA-computed.
transitions->{label_prev}->{label_this} = prob
or
transitions->{feature}->{label_prev}->{label_this} = score
In some algorithms linear combination smoothing is used for transition probabilities. The resulting transition probability is then obtained as:
PROB(label|prev_label) = smooth_bigrams * transitions->{prev_label}->{label} + smooth_unigrams * unigrams->{label} + smooth_uniform
The actual smoothing parameters computed by EM algorithm. Each of them is between 0 and 1 and together they sum up to 1.
Unifrom probability of a label, computed as 1 / ( keys %{ $self-unigrams } )>.
1 / ( keys %{ $self-
Set in compute_smoothing_params.
compute_smoothing_params
Basic MLE from data, the structure is
unigrams->{label} = prob
To be used for transitions smoothing and/or backoff (can be used both for emissions and transitions) It also contains the SEQUENCE_BOUNDARY_LABEL prob (the SEQUENCE_BOUNDARY_LABEL is counted once for each sequence) which might be unappropriate in some cases (eg. for emission probs).
SEQUENCE_BOUNDARY_LABEL
Just an array ref with the sentences that represent the heldout data to be able to run the EM algorithm in prepare_for_mira(). Used only in training.
prepare_for_mira()
Subroutines inherited from Treex::Tool::Parser::MSTperl::ModelBase.
See "store" in Treex::Tool::Parser::MSTperl::ModelBase.
See "store_tsv" in Treex::Tool::Parser::MSTperl::ModelBase.
See "load" in Treex::Tool::Parser::MSTperl::ModelBase.
See "load_tsv" in Treex::Tool::Parser::MSTperl::ModelBase.
Subroutines overriding stubs in Treex::Tool::Parser::MSTperl::ModelBase.
Returns the model data, containing the following fields: unigrams, transitions, emissions, smooth_uniform, smooth_unigrams, smooth_bigrams, uniform_prob
unigrams
transitions
emissions
smooth_uniform
smooth_unigrams
smooth_bigrams
uniform_prob
Tries to get all necessary data from $data (see get_data_to_store to see what data are stored). Also does basic checks on the data, eg. for non-emptiness, but nothing sophisticated. Is algorithm-sensitive, i.e. if some data are not needed for the algorithm used, they do not have to be contained in the hash.
$data
get_data_to_store
Called after preprocessing training data, before entering the MIRA phase.
Function varies depending on algorithm used. Usually recomputes counts stored in emissions, transitions and unigrams to probabilities that have been computed by add_emission, add_transition and add_unigram. Also calls compute_smoothing_params to estimate smoothing parameters for smoothing of transition probabilities.
add_emission
add_transition
add_unigram
Only to provide information about the model. Returns number of features in the model (where a "feature" can stand for various things depending on the algorithm used).
my $model = Treex::Tool::Parser::MSTperl::ModelLabelling->new( config => $config, );
Creates an empty model. If you are training the model, this is probably what you want, otherwise you can use load or load_tsv to load an existing labelling model from a file.
load
load_tsv
However, most often you would probably use a model for a labeller (Treex::Tool::Parser::MSTperl::Labeller) or a labelling trainer (Treex::Tool::Parser::MSTperl::TrainerLabelling) which both automatically create the model on build. The labeller also provides wrapping methods "load_model" in Treex::Tool::Parser::MSTperl::Labeller and "load_model_tsv" in Treex::Tool::Parser::MSTperl::Labeller which you can call to load the model from a file. (Btw. as you might expect, the trainer provides methods "store_model" in Treex::Tool::Parser::MSTperl::TrainerLabelling and "store_model_tsv" in Treex::Tool::Parser::MSTperl::TrainerLabelling.)
emissions and transitions can be either MIRA-trained or estimated directly from training data using MLE (Maximum Likelihood Estimate). unigrams are always estimated by MLE.
Increment count for the label in unigrams.
Increment count for the transition in transitions, possible including a feature on "this" edge if the algorithm uses features with transitions.
Increment count for this label on an edge with this feature in emissions.
Takes a hash reference with label counts and chnages the counts to probabilities (this is the actual MLE). May be called in prepare_for_mira on emissions, transitions and unigrams.
prepare_for_mira
The main method containing an implementation of the Expectation Maximization Algorithm to compute smoothing parameters (smooth_bigrams, smooth_unigrams, smooth_uniform) for transition probabilities smoothing by linear combination of bigram, unigram and uniform probability. Iteratively tries to find such parameters that the probabilities from training data (transitions, unigrams and uniform_prob) combined together by the smoothing parameters model well enough the heldout data (EM_heldout_data), i.e. tries to maximize the probability of the heldout data given the training data probabilities by adjusting the smoothing parameters values.
EM_heldout_data
Uses EM_EPSILON as a stopping criterion, i.e. stops when the sum of absolute values of changes to all smoothing parameters are lower than the value of EM_EPSILON.
EM_EPSILON
Support methods to compute_smoothing_params, in the order in which they call each other.
A bunch of methods to score the likelihood of a label being assigned to an edge based on the edge's features and the label assigned to the previous edge.
Returns (a reference to) an array of all labels found in the training data (eg. ['Subj', 'Obj', 'Atr']).
['Subj', 'Obj', 'Atr']
Computes a score of assigning the given label to an edge, given the features of the edge and the label assigned to the previous edge.
Always a higher score means a more likely label for the edge. Some algorithms may give a negative score.
Is semantically equivalent to calling get_emission_score and get_transition_score and then combining it together somehow.
get_emission_score
get_transition_score
Computes the "emission score" of assigning the given label to an edge, given one of the feature of the edge and disregarding the label assigned to the previous edge.
Computes the "transition score" of assigning the given label to an edge, given the label assigned to the previous edge and possibly also one of the features of the edge but NOT including the emission score returned by get_emission_score.
Returns (a reference to) an array of the probabilities of the transition from label_prev to label_this (to be smoothed together), having the following structure:
$result->[0] = uniform prob $result->[1] = unigram prob $result->[2] = bigram prob
Get scores of assigning each of the possible labels to an edge based on all the features of the edge. Is semantically equivalent to doing:
foreach label foreach feature get_emission_score(label, feature)
$result->{label} = score
Actually only serves as a switch for several implementations of the method (get_emission_scores_basic_MIRA and get_emission_scores_no_MIRA); the method to be used is selected based on the algorithm being used.
get_emission_scores_basic_MIRA
get_emission_scores_no_MIRA
A get_emission_scores implementation used with algorithms where the emission scores are computed by MIRA (this is currently the most successful implementation).
get_emission_scores
A get_emission_scores implementation using only MLE. Probably obsolete now.
Methods used by the trainer (Treex::Tool::Parser::MSTperl::TrainerLabelling) to adjust the scores to whatever seems to be the best idea at the moment. Used only in MIRA training (MLE uses add_unigram, add_emission, add_transition and compute_probs_from_counts instead).
compute_probs_from_counts
Sets the specified emission score (if label_prev is not set) or transition score (if it is) to the given value ($score).
$score
Updates the specified emission score (if label_prev is not set) or transition score (if it is) by the given value ($update), i.e. adds that value to the current value.
$update
Rudolf Rosa <rosa@ufal.mff.cuni.cz>
Copyright © 2011 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Treex::Tool::Parser::MSTperl::Node, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Treex::Tool::Parser::MSTperl::Node
CPAN shell
perl -MCPAN -e shell install Treex::Tool::Parser::MSTperl::Node
For more information on module installation, please visit the detailed CPAN module installation guide.