Ingrid Falk > Cluster-Similarity-v0.02 > Cluster::Similarity

Download:
Cluster-Similarity-v0.02.tar.gz

Dependencies

Annotate this POD

CPAN RT

New  2
Open  0
View/Report Bugs
Module Version: 0.2.0   Source  

NAME ^

Cluster::Similarity - compute the similarity of two classifications.

VERSION ^

Version 0.02

SYNOPSIS ^

Compute similarity of two classifications following various cluster similarity evaluation schemes based on contingency tables.

    use Cluster::Similarity;


    my $sim_calculator = Cluster::Similarity->new( $classification_1, $classification_2 );


    my $pair_wise_recall = $sim_calculator->pair_wise_recall();
    my $pair_wise_precision = $sim_calculator->pair_wise_precision();
    my $pair_wise_f_score = $sim_calculator->pair_wise_fscore();

    my $mutual_information = $sim_calculator->mutual_information();
    
    my $rand_index = $sim_calculator->rand_index();

    my $rand_adj = $sim_calculator->rand_adjusted($max_index);
    
    my $matching = $sim_calculator->matching_index();


    my $contingency_table = $sim_calculator->contingency();
    
    my $pairs_matrix = $sim_calculator->pairs_matrix();

    my $pair_of_cell_12 = $sim_calculator->pairs(1,2);

DESCRIPTION ^

Computes the similarity of two word clusterings using several clustering similarity measures.

Consider for eg. the following groupings:

clustering_1: { {a, b, c}, {d, e, f} } clustering_2: { {a, b}, {c, d, e}, {f} }

Cluster similarity measures provide a numerical value helping to assess the alikeness of two such groupings.

All cluster similarity measures implemented in this module are based on the so-called contingency table of the two classifications (clusterings). The contingency table is a matrix with a cell for each pair of classes (one from each classification), containing the number of objects present in both classes.

The similarity measures (and also examples and tests) are taken from Chapter 4 of Susanne Schulte im Walde's Phd thesis:

Sabine Schulte im Walde. Experiments on the Automatic Induction of German Semantic Verb Classes. PhD thesis, Institut für Maschinelle Sprachverarbeitung, Universität Stuttgart, 2003. Published as AIMS Report 9(2) http://www.schulteimwalde.de/phd-thesis.html

Please see there for a more in depth description of the similarity measures and further details.

INTERFACE ^

Constructor

new()

Builds a new Cluster::Similarity object.

FUNCTIONS ^

Providing the Data

load_data(\@classification_1, \@classification_2)
load_data(\%classification_1, \%classification_2)
set_classification_1(\@classification_1), set_classification1(\@classification_2)
set_classification_2(\%classification_1), set_classification1(\%classification_2)

When calling these methods, the contingency tables and all previously computed similarity values are reset.

objects, object_number

Return (number of) objects in either classification

contingency

Compute the contingency table for two classifications. The contingency table is a matrix with a cell for each pair of classes (one class from each classification). Each cell contains the number of objects present in both classes.

Eg. For the classifications

the returned contingency table is:

 {
   'c_1' => {
             'c_1' => 2,
             'c_2' => 0
            },
   'c_2' => {
             'c_1' => 1,
             'c_2' => 2
            },
   'c_3' => {
             'c_1' => 0,
             'c_2' => 1
            }
 }

Which is a hash representation of the matrix:

      2  0
      1  2
      0  1

with the columns indexed by the classes of the first classification and the rows by the classes of the second classification.

pairs_contingency

Compute the contingency table for the number of common element pairs in the two classifications.

For the example above this would be:

   1 0
   0 0
   0 1

true_positives

True positives are the number of object pairs which occur together in both classifications.

pairs_classification_1, pairs_classification_2

Number of pairs in classification.

pair_wise_precision, pair_wise_recall, pair_wise_fscore

Pair-wise recall is the number of true positives divided by the number of pairs in classification 1

Pair-wise precision is the number of true positives divided by the number of pairs in classification 2

Pair-wise F-score is the harmonic mean of precision and recall, i.e. 2*precision*recall / (precision + recall)

mutual_information

Mutual information is a symmetric measure for the degree of dependency between two classifications used here as introduced by Strehl et. al. (2000).

rand_index

The Rand index (defined by Rand, 1971) is based on the agreement vs. disagreement between object pairs in clusterings.

rand_adjusted

Rand index adjusted by chance (Hubert and Arabie 1985). The adopted model for randomness assumes that the two classifications are picked at random, given the original number of classes and objects - the contingency table is constructed from the hyper-geometric distribution. The general form of an index corrected for chance is:

  Index_adj = (Index - Expected Index) / (Maximum Index - Expected Index)

As maximum index I use the minimum of possible pairs in either classifications.

matching_index

Matching index (Fowlkes and Mallows, 1983).

DIAGNOSTICS ^

<Need reference to classification>

When a "Providing the data" method is called without enough arguments.

<Classifications must be passed as array or hash references>

Argument of wrong type.

<Please set/load classifications before calling ... method>

Method was called without providing classification data first, by calling one of the ""Providing the data" methods.

<Need data for classification 1/2>

Data for classification 1 (2 resp.) is missing.

CONFIGURATION AND ENVIRONMENT ^

Cluster::Similarity requires no configuration files or environment variables.

DEPENDENCIES ^

Carp
Class::Std
List::Util qw(sum min)
Math::Combinatorics

INCOMPATIBILITIES ^

None reported.

BUGS AND LIMITATIONS ^

No bugs have been reported.

Please report any bugs or feature requests to bug-cluster-similarity@rt.cpan.org, or through the web interface at http://rt.cpan.org.

TO DO ^

AUTHOR ^

Ingrid Falk, <ingrid dot falk at loria dot fr>

BUGS ^

Please report any bugs or feature requests to bug-cluster-similarity at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Cluster-Similarity. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT ^

You can find documentation for this module with the perldoc command.

    perldoc Cluster::Similarity

You can also look for information at:

SEE ALSO ^

COPYRIGHT & LICENSE ^

Copyright 2008 Ingrid Falk, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: