NAME

Set::Partitions::Similarity - Routines to measure similarity of partitions.

SYNOPSIS

  use Set::Partitions::Similarity qw(getAccuracyAndPrecision);
  use Data::Dump qw(dump);

  # set elements are Perl strings, sets are array references
  # partitions are nested arrays.
  dump getAccuracyAndPrecision ([[qw(a b)],[1,2]], [[qw(a b 1)],[2]]);
  # dumps:
  # ("0.5", "0.25")

  # a partition is equivalent to itself, even the empty partition.
  dump getAccuracyAndPrecision ([[1,2], [3,4]], [[2,1], [4,3]]);
  dump getAccuracyAndPrecision ([], []);
  # dumps:
  # (1, 1)
  # (1, 1)

  # accuracy and precision are symmetric functions.
  my ($p, $q) = ([[1,2,3], [4]], [[1], [2,3,4]]);
  dump getAccuracyAndPrecision ($p, $q);
  dump getAccuracyAndPrecision ($q, $p);
  # dumps:
  # ("0.333333333333333", "0.2")
  # ("0.333333333333333", "0.2")

  # checks partitions and throws an exception.
  eval { getAccuracyAndPrecision ([[1]], [[1,2]], 1); };
  warn $@ if $@;
  # dumps:
  # partitions are invalid, they have different set elements.

DESCRIPTION

A partition of a set is a collection of mutually disjoint subsets of the set whose union is the set. Set::Partitions::Similarity provides routines that measure the accuracy and precision between two partitions of a set. The measures can assess the performance of a binary clustering algorithm by comparing the clusters the algorithm creates against the correct clusters of test data.

Accuracy and Precision

Let S be a set of n elements and let P be a partition of S. Let T(S) be the set of all sets of two distinct elements of S; so T(S) has n*(n-1)/2 sets. The partition P uniquely defines a partitioning of T(S) into two sets, C(P) and D(P) where C(P) is the set of all pairs in T(S) such that both elements of a pair occur in the same set in P, and define D(P) as T(S)-C(P), the complement.

Given two partitions P and Q of the set S, the accuracy is defined as (|C(P) ^ C(Q)| + |D(P) ^ D(Q)|) / (n*(n-1)/2), where | | gives the size of a set and ^ represents the intersection operator. The precision is defined as |C(P) ^ C(Q)| / (|C(P) ^ C(Q)| + |C(P) ^ D(Q)| + |D(P) ^ C(Q)|). The accuracy and precision return values ranging from zero (no similarity) to one (equivalent partitions). The distance between two partitions is defined as 1-accuracy, and in mathematics is a metric. The distance returns values ranging from zero (equivalent partitions) to one (no similarity).

All the methods implemented that compute the accuracy, precision, and distance run in time linear in the number of elements of the set partitioned.

ROUTINES

`areSubsetsDisjoint ($Partition)`

The routine areSubsetsDisjoint returns true if the subsets of the partition are disjoint, false otherwise. It can be used to check the validity of a partition.

$Partition: The partition is stored as a nested array reference of the form [[],...[]]. For example, the set partition {{a,b}, {1,2}} of the set {a,b,1,2} should be stored as the nested array reference [['a','b']],[1,2]]. Note the elements of a set are represented as Perl strings.

An example:

  use Set::Partitions::Similarity qw(areSubsetsDisjoint);
  use Data::Dump qw(dump);
  dump areSubsetsDisjoint ([[1,2,3], [4]]);
  dump areSubsetsDisjoint ([[1,2,3], [4,1]]);
  # dumps:
  # "1"
  # "0"

`getAccuracy ($PartitionP, $PartitionQ, $CheckValidity)`

The routine getAccuracy returns the accuracy of the two partitions.

$PartitionP, $PartitionQ: The partitions are stored as nested array references of the form [[],...[]]. For example, the set partition {{a,b}, {1,2}} of the set {a,b,1,2} should be stored as the nested array references [['a','b']],[1,2]]. Note the elements of a set are represented as Perl strings.
$CheckValidity: If $CheckValidity evaluates to true, then checks are performed to ensure both partitions are valid and an exception is thrown if they are not. The default is false.

An example:

  use Set::Partitions::Similarity qw(getAccuracy);
  use Data::Dump qw(dump);
  dump getAccuracy ([[qw(a b)], [qw(c d)]], [[qw(a b c d)]]);
  dump getAccuracy ([[qw(a b c d)]], [[qw(a b)], [qw(c d)]]);
  # dumps:
  # "0.333333333333333"
  # "0.333333333333333"

`getAccuracyAndPrecision ($PartitionP, $PartitionQ, $CheckValidity)`

The routine getAccuracyAndPrecision returns the accuracy and precision of the two partitions as an array (accuracy, precision).

$PartitionP, $PartitionQ: The partitions are stored as nested array references of the form [[],...[]]. For example, the set partition {{a,b}, {1,2}} of the set {a,b,1,2} should be stored as the nested array references [['a','b']],[1,2]]. Note the elements of a set are represented as Perl strings.
$CheckValidity: If $CheckValidity evaluates to true, then checks are performed to ensure both partitions are valid and an exception is thrown if they are not. The default is false.

An example:

  use Set::Partitions::Similarity qw(getAccuracyAndPrecision);
  use Data::Dump qw(dump);
  dump getAccuracyAndPrecision ([[1,2], [3,4]], [[1], [2], [3], [4]]);
  dump getAccuracyAndPrecision ([[1], [2], [3], [4]], [[1,2], [3,4]]);
  # dumps:
  # ("0.666666666666667", 0)
  # ("0.666666666666667", 0)

`getDistance ($PartitionP, $PartitionQ, $CheckValidity)`

The routine getDistance returns 1-accuracy of the two partitions, or 1-getAccuracy($PartitionP, $PartitionQ, $CheckValidity).

$PartitionP, $PartitionQ: The partitions are stored as nested array references of the form [[],...[]]. For example, the set partition {{a,b}, {1,2}} of the set {a,b,1,2} should be stored as the nested array references [['a','b']],[1,2]]. Note the elements of a set are represented as Perl strings.
$CheckValidity: If $CheckValidity evaluates to true, then checks are performed to ensure both partitions are valid and an exception is thrown if they are not. The default is false.

An example:

  use Set::Partitions::Similarity qw(getDistance);
  use Data::Dump qw(dump);
  dump getDistance ([[1,2,3], [4]], [[1], [2,3,4]]);
  # dumps:
  # "0.666666666666667"

`getPrecision ($PartitionP, $PartitionQ, $CheckValidity)`

The routine getPrecision returns the precision of the two partitions.

$PartitionP, $PartitionQ: The partitions are stored as nested array references of the form [[],...[]]. For example, the set partition {{a,b}, {1,2}} of the set {a,b,1,2} should be stored as the nested array references [['a','b']],[1,2]]. Note the elements of a set are represented as Perl strings.
$CheckValidity: If $CheckValidity evaluates to true, then checks are performed to ensure both partitions are valid and an exception is thrown if they are not. The default is false.

An example:

  use Set::Partitions::Similarity qw(getPrecision);
  use Data::Dump qw(dump);
  dump getPrecision ([[1,2,3], [4]], [[1], [2,3,4]]);
  # dumps:
  # "0.2"

EXAMPLE

The code following measures the distance of a set of 512 elements partitioned equally into subsets of size $s to the entire set.

  use Set::Partitions::Similarity qw(getDistance);
  my @p = ([0..511]);
  for (my $s = 1; $s <= 512; $s += $s)
  {
    my @q = map { [$s*$_..($s*$_+$s-1)] } (0..(512/$s-1));
    print join (', ', $s, getDistance (\@p, \@q, 1)) . "\n";
  }
  # dumps:
  # 1, 1
  # 2, 0.998043052837573
  # 4, 0.99412915851272
  # 8, 0.986301369863014
  # 16, 0.970645792563601
  # 32, 0.939334637964775
  # 64, 0.876712328767123
  # 128, 0.75146771037182
  # 256, 0.500978473581213
  # 512, 0

INSTALLATION

To install the module run the following commands:

  perl Makefile.PL
  make
  make test
  make install

If you are on a windows box you should use 'nmake' rather than 'make'.

BUGS

Please email bugs reports or feature requests to bug-set-partitions-similarities@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Set-Partitions-Similarity. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.

AUTHOR

 Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

accuracy, clustering, measure, metric, partitions, precision, set, similarity

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

SYNOPSIS

DESCRIPTION

Accuracy and Precision

ROUTINES

areSubsetsDisjoint ($Partition)

getAccuracy ($PartitionP, $PartitionQ, $CheckValidity)

getAccuracyAndPrecision ($PartitionP, $PartitionQ, $CheckValidity)

getDistance ($PartitionP, $PartitionQ, $CheckValidity)

getPrecision ($PartitionP, $PartitionQ, $CheckValidity)

EXAMPLE

INSTALLATION

BUGS

AUTHOR

COPYRIGHT

KEYWORDS

SEE ALSO

Module Install Instructions

`areSubsetsDisjoint ($Partition)`

`getAccuracy ($PartitionP, $PartitionQ, $CheckValidity)`

`getAccuracyAndPrecision ($PartitionP, $PartitionQ, $CheckValidity)`

`getDistance ($PartitionP, $PartitionQ, $CheckValidity)`

`getPrecision ($PartitionP, $PartitionQ, $CheckValidity)`