lib/Set/Partitions/Similarity.pm

package Set::Partitions::Similarity;

use strict;
use warnings;
#use Data::Dump qw(dump);

BEGIN
{
  use Exporter ();
  use vars qw($VERSION @ISA @EXPORT @EXPORT_OK %EXPORT_TAGS);
  $VERSION     = '0.54';
  @ISA         = qw(Exporter);
  @EXPORT      = qw();
  @EXPORT_OK   = qw(getAccuracy getAccuracyAndPrecision getDistance getPrecision areSubsetsDisjoint);
  %EXPORT_TAGS = ();
}

#Routines to measure similarity of partitions.
#012345678901234567890123456789012345678901234

=head1 NAME

C<Set::Partitions::Similarity> - Routines to measure similarity of partitions.

=head1 SYNOPSIS

  use Set::Partitions::Similarity qw(getAccuracyAndPrecision);
  use Data::Dump qw(dump);

  # set elements are Perl strings, sets are array references
  # partitions are nested arrays.
  dump getAccuracyAndPrecision ([[qw(a b)],[1,2]], [[qw(a b 1)],[2]]);
  # dumps:
  # ("0.5", "0.25")

  # a partition is equivalent to itself, even the empty partition.
  dump getAccuracyAndPrecision ([[1,2], [3,4]], [[2,1], [4,3]]);
  dump getAccuracyAndPrecision ([], []);
  # dumps:
  # (1, 1)
  # (1, 1)

  # accuracy and precision are symmetric functions.
  my ($p, $q) = ([[1,2,3], [4]], [[1], [2,3,4]]);
  dump getAccuracyAndPrecision ($p, $q);
  dump getAccuracyAndPrecision ($q, $p);
  # dumps:
  # ("0.333333333333333", "0.2")
  # ("0.333333333333333", "0.2")

  # checks partitions and throws an exception.
  eval { getAccuracyAndPrecision ([[1]], [[1,2]], 1); };
  warn $@ if $@;
  # dumps:
  # partitions are invalid, they have different set elements.

=head1 DESCRIPTION

A partition of a set is a collection of mutually disjoint subsets of the set
whose union is the set. C<Set::Partitions::Similarity> provides routines
that measure the I<accuracy> and I<precision> between two partitions of a set. The measures can
assess the performance of a binary clustering algorithm by comparing
the clusters the algorithm creates against the correct clusters of test data.

=head2 Accuracy and Precision

Let C<S> be a set of C<n> elements and let C<P> be a partition of C<S>. Let C<T(S)>
be the set of all sets of two distinct elements of C<S>; so C<T(S)> has C<n*(n-1)/2> sets.
The partition C<P> uniquely defines a partitioning of C<T(S)> into two sets, C<C(P)> and C<D(P)> where
C<C(P)> is the set of all pairs in C<T(S)> such that both elements of a pair
occur in the same set in C<P>, and define C<D(P)> as C<T(S)-C(P)>, the complement.

Given two partitions C<P> and C<Q> of the set C<S>, the I<accuracy> is defined as
C<(|C(P) ^ C(Q)| + |D(P) ^ D(Q)|) / (n*(n-1)/2)>, where | | gives the size of a set and
^ represents the intersection operator. The I<precision> is defined as
C<|C(P) ^ C(Q)| / (|C(P) ^ C(Q)| + |C(P) ^ D(Q)| + |D(P) ^ C(Q)|)>. The I<accuracy> and
I<precision> return values ranging from zero (no similarity) to one (equivalent partitions).
The I<distance> between two partitions is defined as I<1-accuracy>, and in mathematics is a metric.
The I<distance> returns values ranging from zero (equivalent partitions) to one (no similarity).

All the methods implemented that compute the I<accuracy>, I<precision>, and I<distance> run in time linear in the
number of elements of the set partitioned.

=head1 ROUTINES

=head2 C<areSubsetsDisjoint ($Partition)>

The routine C<areSubsetsDisjoint> returns true if the subsets of the partition are disjoint,
false otherwise. It can be used to check the validity of a partition.

=over

=item C<$Partition>

The partition is stored as a nested array reference of the form
C<[[],...[]]>. For example, the set partition C<{{a,b}, {1,2}}> of the set
C<{a,b,1,2}> should be stored as the nested array reference
C<[['a','b']],[1,2]]>. Note the elements
of a set are represented as Perl strings.

=back

An example:

  use Set::Partitions::Similarity qw(areSubsetsDisjoint);
  use Data::Dump qw(dump);
  dump areSubsetsDisjoint ([[1,2,3], [4]]);
  dump areSubsetsDisjoint ([[1,2,3], [4,1]]);
  # dumps:
  # "1"
  # "0"

=cut

# a valid partition has all the subsets mutually disjoint. this routine
# returns 0 if it finds two distinct subsets have an element in common.
# this is done in time linear in the number of elements by computing the
# prefix union of the sets in the partition using a hash.
sub areSubsetsDisjoint
{
  # the hash %prefixUnionOfSubsets holds the union of elements in each subset
  # as they are checked for elements that occur in more than one subset.
  my %prefixUnionOfSubsets;
  foreach my $subset (@{$_[0]})
  {
    # since it is possible that a subset could contain a repeating element,
    # first each element is checked without adding it to the hash.
    foreach my $element (@$subset)
    {
      if (exists $prefixUnionOfSubsets{$element})
      {
        # if the second parameter is true, throw and exception, otherwise return false.
        if (defined ($_[1]) && $_[1])
        {
          die "element '$element' occurs in two of the subsets.\n";
        }
        else
        {
          return 0;
        }
      }
    }

    # now we can add all the elements to the hash.
    foreach my $element (@$subset)
    {
      $prefixUnionOfSubsets{$element} = 1;
    }
  }

  # the subsets are disjoint, return true.
  return 1;
}

=head2 C<getAccuracy ($PartitionP, $PartitionQ, $CheckValidity)>

The routine C<getAccuracy> returns the I<accuracy> of the
two partitions.

=over

=item C<$PartitionP, $PartitionQ>

The partitions are stored as nested array references of the form
C<[[],...[]]>. For example, the set partition C<{{a,b}, {1,2}}> of the set
C<{a,b,1,2}> should be stored as the nested array references
C<[['a','b']],[1,2]]>. Note the elements
of a set are represented as Perl strings.

=item C<$CheckValidity>

If C<$CheckValidity> evaluates to true, then checks are performed to
ensure both partitions are valid and an exception is thrown if they
are not. The default is false.

=back

An example:

  use Set::Partitions::Similarity qw(getAccuracy);
  use Data::Dump qw(dump);
  dump getAccuracy ([[qw(a b)], [qw(c d)]], [[qw(a b c d)]]);
  dump getAccuracy ([[qw(a b c d)]], [[qw(a b)], [qw(c d)]]);
  # dumps:
  # "0.333333333333333"
  # "0.333333333333333"

=cut

sub getAccuracy
{
  my ($ReferencePartition, $ModelPartition, $CheckValidity) = @_;

  # get both similarities.
  my ($accuracy, $precision) = getAccuracyAndPrecision ($ReferencePartition, $ModelPartition, $CheckValidity);

  # return just the accuracy.
  return $accuracy;
}


=head2 C<getAccuracyAndPrecision ($PartitionP, $PartitionQ, $CheckValidity)>

The routine C<getAccuracyAndPrecision> returns the I<accuracy> and I<precision> of the
two partitions as an array C<(accuracy, precision)>.

=over

=item C<$PartitionP, $PartitionQ>

The partitions are stored as nested array references of the form
C<[[],...[]]>. For example, the set partition C<{{a,b}, {1,2}}> of the set
C<{a,b,1,2}> should be stored as the nested array references
C<[['a','b']],[1,2]]>. Note the elements
of a set are represented as Perl strings.

=item C<$CheckValidity>

If C<$CheckValidity> evaluates to true, then checks are performed to
ensure both partitions are valid and an exception is thrown if they
are not. The default is false.

=back

An example:

  use Set::Partitions::Similarity qw(getAccuracyAndPrecision);
  use Data::Dump qw(dump);
  dump getAccuracyAndPrecision ([[1,2], [3,4]], [[1], [2], [3], [4]]);
  dump getAccuracyAndPrecision ([[1], [2], [3], [4]], [[1,2], [3,4]]);
  # dumps:
  # ("0.666666666666667", 0)
  # ("0.666666666666667", 0)

=cut

sub getAccuracyAndPrecision
{
  my ($ReferencePartition, $ModelPartition, $CheckValidity) = @_;

  # get the base count of edge types.
  my ($sameRefSameModel, $sameRefDiffModel, $diffRefSameModel, $diffRefDiffModel) = getBaseEdgeCounts ($ReferencePartition, $ModelPartition, $CheckValidity);

  # get the total number of bases edges.
  my $baseEdges = $sameRefSameModel + $sameRefDiffModel + $diffRefSameModel;

  # if there are no base edges, the precision is one.
  my $precision = 1;

  # get the precision.
  $precision = $sameRefSameModel / $baseEdges if $baseEdges;

  # get the total number of edges.
  my $totalEdges = $sameRefSameModel + $sameRefDiffModel + $diffRefSameModel + $diffRefDiffModel;

  # if there are no edges, the accuracy is one.
  my $accuracy = 1;

  # get the accuracy.
  $accuracy = ($sameRefSameModel + $diffRefDiffModel) / $totalEdges if $totalEdges;

  return ($accuracy, $precision);
}


=head2 C<getDistance ($PartitionP, $PartitionQ, $CheckValidity)>

The routine C<getDistance> returns I<1-accuracy> of the
two partitions, or C<1-getAccuracy($PartitionP, $PartitionQ, $CheckValidity)>.

=over

=item C<$PartitionP, $PartitionQ>

The partitions are stored as nested array references of the form
C<[[],...[]]>. For example, the set partition C<{{a,b}, {1,2}}> of the set
C<{a,b,1,2}> should be stored as the nested array references
C<[['a','b']],[1,2]]>. Note the elements
of a set are represented as Perl strings.

=item C<$CheckValidity>

If C<$CheckValidity> evaluates to true, then checks are performed to
ensure both partitions are valid and an exception is thrown if they
are not. The default is false.

=back

An example:

  use Set::Partitions::Similarity qw(getDistance);
  use Data::Dump qw(dump);
  dump getDistance ([[1,2,3], [4]], [[1], [2,3,4]]);
  # dumps:
  # "0.666666666666667"

=cut

sub getDistance
{
  my $accuracy = getAccuracy (@_);
  return 1 - $accuracy if defined $accuracy;
  return undef;
}


=head2 C<getPrecision ($PartitionP, $PartitionQ, $CheckValidity)>

The routine C<getPrecision> returns the I<precision> of the
two partitions.

=over

=item C<$PartitionP, $PartitionQ>

The partitions are stored as nested array references of the form
C<[[],...[]]>. For example, the set partition C<{{a,b}, {1,2}}> of the set
C<{a,b,1,2}> should be stored as the nested array references
C<[['a','b']],[1,2]]>. Note the elements
of a set are represented as Perl strings.

=item C<$CheckValidity>

If C<$CheckValidity> evaluates to true, then checks are performed to
ensure both partitions are valid and an exception is thrown if they
are not. The default is false.

=back

An example:

  use Set::Partitions::Similarity qw(getPrecision);
  use Data::Dump qw(dump);
  dump getPrecision ([[1,2,3], [4]], [[1], [2,3,4]]);
  # dumps:
  # "0.2"

=cut

sub getPrecision
{
  my ($ReferencePartition, $ModelPartition, $CheckValidity) = @_;

  # get both Similarity.
  my ($accuracy, $precision) = getAccuracyAndPrecision ($ReferencePartition, $ModelPartition, $CheckValidity);

  # return just the precision.
  return $precision;
}


sub getBaseEdgeCounts
{
  my ($ReferencePartition, $ModelPartition, $CheckValidity) = @_;

  # validates the partitions or throws an exception.
  validatePartitionsOrDie ($ReferencePartition, $ModelPartition) if $CheckValidity;

  my ($sameRefSameModel, $sameRefDiffModel) = getPartitionsEdgeCounts ($ReferencePartition, $ModelPartition);
  my ($sameModelSameRef, $diffRefSameModel) = getPartitionsEdgeCounts ($ModelPartition, $ReferencePartition);

  # make sure the number of edges for
  if ($sameRefSameModel != $sameModelSameRef)
  {
    die "programming error: computed different values for number of edges in same partitions.\n";
  }

  # get the number of elements in the universe of the sets.
  my $totalElements;
  {
    my %universe = map { ($_, 1) } map { @$_ } @$ReferencePartition;
    $totalElements = scalar keys %universe;
  }

  # get the total edges.
  my $totalEdges = ($totalElements * ($totalElements - 1)) / 2;

  return ($sameRefSameModel, $sameRefDiffModel, $diffRefSameModel, $totalEdges - $sameRefSameModel - $sameRefDiffModel - $diffRefSameModel);
}


sub getPartitionsEdgeCounts
{
  my ($ReferencePartition, $ModelPartition) = @_;

  my %modelId;
  for (my $id = 0; $id < @$ModelPartition; $id++)
  {
    my $subset = $ModelPartition->[$id];
    foreach my $element (@$subset)
    {
      $modelId{$element} = $id;
    }
  }

  # count the number of edges in the same partitions and the number in the
  # same reference partitions but difference model partitions.
  my $samePartitions = 0;
  my $sameRefDiffModel = 0;
  foreach my $subset (@$ReferencePartition)
  {
    my %subsetModelPartitionCounts;

    # need to ensure the subset elements are unique.
    {
      my %elements;
      for (my $i = 0; $i < @$subset; $i++)
      {
        unless (exists ($elements{$subset->[$i]}))
        {
          $elements{$subset->[$i]} = 1;
          ++$subsetModelPartitionCounts{$modelId{$subset->[$i]}};
        }
      }
    }

    # get the sizes of the model partitions of the subset.
    my @subsetPartitionSizes = values %subsetModelPartitionCounts;

    # count the number of edges having nodes in the same partitions.
    foreach my $size (@subsetPartitionSizes)
    {
      $samePartitions += ($size * ($size - 1)) / 2;
    }

    # count the number of edges having nodes in the same reference partitions
    # but different model partitions.
    my $prefixSumOfSizes;
    $prefixSumOfSizes = $subsetPartitionSizes[0] if @subsetPartitionSizes;
    for (my $i = 1; $i < @subsetPartitionSizes; $i++)
    {
      $sameRefDiffModel += $prefixSumOfSizes * $subsetPartitionSizes[$i];
      $prefixSumOfSizes += $subsetPartitionSizes[$i];
    }
  }

  return ($samePartitions, $sameRefDiffModel);
}


# for the set partitions to be valid, the union of sets of each partition
# must be equal. the routine returns true of they are, false if not.
sub doPartitionsHaveSameUnion
{
  my ($ReferencePartition, $ModelPartition) = @_;

  # add all the reference elements to the hash.
  my %universe = map { ($_, 1) } map { @$_ } @$ReferencePartition;

  # now check each subset of the model partition.
  foreach my $subset (@$ModelPartition)
  {
    # return 0 if an element from the subset is missing.
    foreach my $element (@$subset)
    {
      return 0 unless exists $universe{$element};
    }

    # delete all the elements in the hash from the subset.
    foreach my $element (@$subset)
    {
      delete $universe{$element};
    }
  }

  # if there are any elements remaining return 0.
  return 0 if %universe;

  return 1;
}


# this routine checks that the two partitions have the same union and
# each partition is composed for sets that a mutually disjoint. the
# routine throws and exception is the partitions are invalid.
sub validatePartitionsOrDie
{
  my ($ReferencePartition, $ModelPartition) = @_;

  # make sure the reference partition is valid.
  unless (areSubsetsDisjoint ($ReferencePartition))
  {
    die "first partition is an invalid partition, an element occurs in two or more subsets.\n";
  }

  # make sure the model partition is valid.
  unless (areSubsetsDisjoint ($ModelPartition))
  {
    die "second partition is an invalid partition, an element occurs in two or more subsets.\n";
  }

  # make sure the partitions have the same universe.
  unless (doPartitionsHaveSameUnion ($ReferencePartition, $ModelPartition))
  {
    die "partitions are invalid, they have different set elements.\n";
  }

  return 1;
}

=head1 EXAMPLE

The code following measures the I<distance> of a set of 512 elements partitioned
equally into subsets of size C<$s> to the entire set.

  use Set::Partitions::Similarity qw(getDistance);
  my @p = ([0..511]);
  for (my $s = 1; $s <= 512; $s += $s)
  {
    my @q = map { [$s*$_..($s*$_+$s-1)] } (0..(512/$s-1));
    print join (', ', $s, getDistance (\@p, \@q, 1)) . "\n";
  }
  # dumps:
  # 1, 1
  # 2, 0.998043052837573
  # 4, 0.99412915851272
  # 8, 0.986301369863014
  # 16, 0.970645792563601
  # 32, 0.939334637964775
  # 64, 0.876712328767123
  # 128, 0.75146771037182
  # 256, 0.500978473581213
  # 512, 0

=head1 INSTALLATION

To install the module run the following commands:

  perl Makefile.PL
  make
  make test
  make install

If you are on a windows box you should use 'nmake' rather than 'make'.

=head1 BUGS

Please email bugs reports or feature requests to C<bug-set-partitions-similarities@rt.cpan.org>, or through
the web interface at L<http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Set-Partitions-Similarity>.  The author
will be notified and you can be automatically notified of progress on the bug fix or feature request.

=head1 AUTHOR

 Jeff Kubina<jeff.kubina@gmail.com>

=head1 COPYRIGHT

Copyright (c) 2009 Jeff Kubina. All rights reserved.
This program is free software; you can redistribute
it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the
LICENSE file included with this module.

=head1 KEYWORDS

accuracy, clustering, measure, metric, partitions, precision, set, similarity

=head1 SEE ALSO

=begin html

<p>Concise explainations of many cluster validity measures (including set partition measures) are available on
the <a href="http://machaon.karanagai.com/validation_algorithms.html">Cluster validity algorithms</a> page
of the <a href="http://machaon.karanagai.com/">Machaon Clustering and Validation Environment</a> web site
 by Nadia Bolshakova.</p>

<p>The Wikipedia article <a href="http://en.wikipedia.org/wiki/Accuracy_and_precision">Accuracy and precision</a> has a good explaination
of the <em>accuracy</em> and <em>precision</em> measures when applied to
<a href="http://en.wikipedia.org/wiki/Accuracy_and_precision#Accuracy_and_precision_in_binary_classification">binary classifications</a>.</p>

<p>The report <a href="http://bit.ly/fXWIgb">Objective Criteria for the Evaluation of Clustering Methods</a> (1971)
by W.M. Rand in the Journal of the American Statistical Association provides an excellent analysis of the <em>accuracy</em>
measure of partitions.</p>

=end html

=cut

1;
# The preceding line will help the module return a true value
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)