The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::ClusterPoints - find clusters inside a set of points

SYNOPSIS

  use Algorithm::ClusterPoints;
  my $clp = Algorithm::ClusterPoints->new( dimension => 3,
                                           radius => 1.0,
                                           minimum_size => 2,
                                           ordered => 1 );
  for my $p (@points) {
      $clp->add_point($p->{x}, $p->{y}, $p->{z});
  }
  my @clusters = $clp->clusters_ix;
  for my $i (0..$#clusters) {
      print( join( ' ',
                   "cluster $i:",
                   map {
                       my ($x, $y, $z) = $clp->point_coords($_);
                       "($_: $x, $y, $z)"
                   } @{$clusters[$i]}
                 ), "\n"
           );
  }

DESCRIPTION

This module implements an algorithm to find clusters of points inside a set.

Clusters are defined as sets of points where it is possible to stablish a way between any pair of points moving from point to point inside the cluster in steps smaller than a given radius.

Points can have any dimension from one to infinitum, though the algorithm performance degrades quickly as the dimension increases (it has O((2*D)^D) complexity).

The algorithm input parameters are:

$dimension

Dimension of the problem space. For instance, for finding clusters on a geometric plane, dimension will be 2.

$radius

A point is part of a cluster when there is at least another point from the cluster that is at a distance smaller than $radius from it.

$minimum_size

Minimum_number of points required to form a cluster, the default is one.

@points

The coordinates of the points

$ordered

Order the points inside the clusters by their indexes and also order the clusters by the index of the contained points.

Ordering the output data is optional because it can be an computational expensive operation.

API

This module has an object oriented interface with the following methods:

Algorithm::ClusterPoints->new(%args)

returns a new object.

The accepted arguments are:

dimension => $dimension

number of dimensions of the points (defaul is 2).

radius => $radius

maximum aceptable distance between two points to form a cluster (default is 1.0).

minimum_size => $minimum_size

minimun cluster size (default is 1).

ordered => $ordered

sort the returned data structures (default is false).

scales => [$x_scale, $y_scale, ...]

point coordinates are scaled by the coefficients given.

dimensional_groups => \@dimension_groups

See the "Using hypercylindrical distances" chapter below.

$clp->add_point($x, $y, $z, ...)
$clp->add_points($x0, $y0, $z0..., $x1, $y1, $z1..., ...);

These methods register points into the algorithm.

They return the index of the (first) point added.

$clp->radius
$clp->radius($radius)
$clp->minimum_size
$clp->minimum_size($minimum_size)
$clp->ordered
$clp->ordered($ordered)

These methods get or set the algorithm parameters.

@scales = $clp->scales;
@scales = $clp->scales($xs, $ys, $zs, ...);

gets/sets the scales for all the dimensions.

@coords = $clp->point_coords($index)

returns the coordinates of the point at index $index.

@clusters_ix = $clp->clusters_ix

returns a list of clusters defined by the indexes of the points inside

The data estructure returned is a list of arrays. Every array represents a cluster and contains the indexes of the points inside.

For instance:

  @clusters_ix = ( [ 0, 1, 5, 10, 13, 15, 17, 31, 32, 38 ],
                   [ 2, 12, 20, 26, 27, 29, 33 ],
                   [ 3, 22, 39 ],
                   [ 4, 11, 16, 30, 36 ],
                   [ 6, 14 ],
                   [ 7, 23, 24 ],
                   [ 18, 25 ],
                   [ 21, 40 ] );

You can get back the coordinates of the points using the method point_coords, as for instance:

   for my $c (@clusters_ix) {
     for my $index (@$c) {
       my ($x, $y, $z) = $clp->point_coords($index);
       ...

Or you can use the method clusters described below that already returns point coordinates.

@clusters = $clp->clusters

returns a list of clusters defined by the coordinates of the points inside.

This method is similar to clusters_ix but instead of the point indexes, it includes the point coordinates inside the cluster arrays.

This is a sample of the returned structure:

  @clusters = ( [ 0.49, 0.32, 0.55, 0.32, 0.66, 0.33 ],
                [ 0.95, 0.20, 0.83, 0.27, 0.90, 0.20 ],
                [ 0.09, 0.09, 0.01, 0.08, 0.12, 0.15 ],
                [ 0.72, 0.42, 0.67, 0.47 ],
                [ 0.83, 0.11, 0.77, 0.13, 0.73, 0.07 ],
                [ 0.37, 0.38, 0.36, 0.44 ],
                [ 0.16, 0.79, 0.14, 0.74 ] );

Note that the coordinate values for all the dimensions are interleaved inside the arrays.

Using hypercylindrical distances

By default distances between points are meassured as euclidean distances. That means that two points A and B form a cluster when B is inside the hypersphere of radius $radius and center A. We will call this hypersphere the clustering limit surface for point A.

Sometimes, specially when the dimensions represent unrelated entities, it is desirable to use hypercylinders as the clustering limit surfaces.

For instance, suppose we have a set of three dimensional points ($x, $y, $t) where the first two dimensions represent coordinates over a geometrical plane and the third coordinate represents time.

It doesn't make sense to mix space and time to calculate a unique distance, and so to have a spherical clustering limit surface. What we need is to set independent limits for geometrical and temporal dimensions, for instance $geo_distance < $geo_radius and $temp_distance < $temp_radius and these pair of constraints define a cylinder on our three-dimensional problem space.

In the general multidimensional case, instead of cylinders, we talk about hypercylinders but the logic behind is the same, dimensions are divided in several groups (d-groups) following some problem defined relation and two points form a cluster when all the subdistances are smaller than the radius (where subdistance is the euclidean distance considering only the dimensions in a d-group). Note that every d-group defines a hypercylinder base.

The method that allows to define the hypercylindrical shape is as follows:

$clp->dimensional_groups(\@group0, \@group1, ...)

where @group0, @group1, ... are lists of dimension indexes.

For instance, for a three dimensional problem with dimensions X, Y and T (in that order), to form a group with the dimensions X and Y and another with the dimension T, the following call has to be used:

  $clp->dimensional_groups([0, 1], [2]);

The dimensional groups can also be set when the constructor is called:

  my $clp = Algoritm::ClusterPoints->new(
                       dimensional_groups => [[0, 1], [2]],
                       ...);

Usually, when using dimensional groups, you would also want to use the scales method to set different scales for every dimension group.

Following with the previous example, supposing X and Y are given in meters and T in seconds, to find the clusters with radius between points of 1Km and 2 days, the following scales should be used:

  my $spc_scl = 1/1000;
  my $tmp_scl = 1/(2 * 24 * 60 * 60);

  $clp = Algorithm::ClusterPoints->new(
                     dimensional_groups => [[0, 1], [2]],
                     scales => [$spc_scl, $spc_scl, $tmp_scl],
                     ...);

SEE ALSO

All began on this PerlMonks discussion: http://perlmonks.org/?node_id=694892.

Algorithm::Cluster is a Perl wrapper for the C Clustering Library.

COPYRIGHT AND LICENSE

Copyright (C) 2008 by Salvador Fandiño (sfandino@yahoo.com)

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.