The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
<?xml version="1.0" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Algorithm::KMeans - Clustering multi-dimensional data with a pure-Perl implementation</title>
<meta http-equiv="content-type" content="text/html; charset=utf-8" />
<link rev="made" href="mailto:root@localhost" />
</head>

<body style="background-color: white">


<!-- INDEX BEGIN -->
<div name="index">
<p><a name="__index__"></a></p>

<ul>

	<li><a href="#name">NAME</a></li>
	<li><a href="#synopsis">SYNOPSIS</a></li>
	<li><a href="#changes">CHANGES</a></li>
	<li><a href="#description">DESCRIPTION</a></li>
	<li><a href="#methods">METHODS</a></li>
	<li><a href="#how_are_the_clusters_output">HOW ARE THE CLUSTERS OUTPUT?</a></li>
	<li><a href="#required">REQUIRED</a></li>
	<li><a href="#examples">EXAMPLES</a></li>
	<li><a href="#export">EXPORT</a></li>
	<li><a href="#caveats">CAVEATS</a></li>
	<li><a href="#bugs">BUGS</a></li>
	<li><a href="#installation">INSTALLATION</a></li>
	<li><a href="#thanks">THANKS</a></li>
	<li><a href="#author">AUTHOR</a></li>
	<li><a href="#copyright">COPYRIGHT</a></li>
</ul>

<hr name="index" />
</div>
<!-- INDEX END -->

<p>
</p>
<h1><a name="name">NAME</a></h1>
<p>Algorithm::KMeans - Clustering multi-dimensional data with a pure-Perl implementation</p>
<p>
</p>
<hr />
<h1><a name="synopsis">SYNOPSIS</a></h1>
<pre>
  use Algorithm::KMeans;</pre>
<pre>
  #  First name the data file:</pre>
<pre>
  my $datafile = &quot;mydatafile.dat&quot;;</pre>
<pre>
  #  Next, set the mask to indicate which columns of the datafile to use for 
  #  clustering and which column contains a symbolic ID for each data record. For
  #  example, if the symbolic name is in the first column, you want the second column
  #  to be ignored, and you want the next three columns to be used for 3D clustering:</pre>
<pre>
  my $mask = &quot;N0111&quot;;</pre>
<pre>
  #  Now construct an instance of the clusterer.  The parameter K controls the number 
  #  of clusters.  If you know how many clusters you want (let's say 3), call</pre>
<pre>
  my $clusterer = Algorithm::KMeans-&gt;new( datafile        =&gt; $datafile,
                                          mask            =&gt; $mask,
                                          K               =&gt; 3,
                                          cluster_seeding =&gt; 'smart',
                                          terminal_output =&gt; 1,
                                          debug           =&gt; 0,
                                        );
 
  #  Note the choice for cluster_seeding. The choice 'smart' means that the clusterer
  #  will (1) subject the data to principal components analysis to determine the maximum
  #  variance direction; (2) project the data onto this direction; (3) find peaks in
  #  a smoothed histogram of the projected points; and (4) use the locations of
  #  the highest peaks as seeds for cluster centers.  The other value for the
  #  &quot;cluster_seeding&quot; option is 'random'.  If the 'smart' option produces bizarre
  #  results, try 'random'.  The default is 'smart'.</pre>
<pre>
  #  If you believe that the individual clusters in your data are not isotropic 
  #  (that is, you believe the variances within each cluster are significantly different 
  #  along the different dimensions), you may wish for the clusterer to first normalize 
  #  the data along each dimension with an estimate for the standard-deviations along 
  #  that dimension and then carry out clustering.  What estimate to use for such standard 
  #  deviations obviously becomes an issue unto itself.  In the current implementation, 
  #  we use overall data standard-deviation along each dimension as the estimate.  
  #  BUT BEWARE THAT IF THE DATA VARIANCE IS CAUSED MORE BY THE SEPARATION BETWEEN THE 
  #  MEANS THAN BY THE INTRA-CLUSTER VARIABILITY, THE DATA NORMALIZATION BY THE STANDARD 
  #  DEVIATIONS COULD ACTUALLY DECREASE THE PERFORMANCE OF THE CLUSTERER.  Here is an 
  #  example call to the constructor for turning on the data normalization:</pre>
<pre>
  my $clusterer = Algorithm::KMeans-&gt;new( datafile =&gt; $datafile,
                                          mask     =&gt; $mask,
                                          K        =&gt; 3,
                                          terminal_output =&gt; 1,
                                          do_variance_normalization =&gt; 1,
                                        );</pre>
<pre>
  #  Set K to 0 if you want the module to figure out the optimum number of clusters 
  #  from the data. (It is best to run this option with the terminal_output set to 
  #  1 so that you can see the different value of QoC for the different K):</pre>
<pre>
  my $clusterer = Algorithm::KMeans-&gt;new( datafile =&gt; $datafile,
                                          mask     =&gt; $mask,
                                          K        =&gt; 0,
                                          terminal_output =&gt; 1,
                                        );</pre>
<pre>
  #  Although not shown above, you can obviously set the 'do_variance_normalization' 
  #  flag here also if you wish.</pre>
<pre>
  #  For very large data files, setting K to 0 will result in searching through 
  #  too many values for K.  For such cases, you can range limit the values of K to 
  #  search through by</pre>
<pre>
  my $clusterer = Algorithm::KMeans-&gt;new( datafile =&gt; $datafile,
                                          mask     =&gt; &quot;N111&quot;,
                                          Kmin     =&gt; 3,
                                          Kmax     =&gt; 10,
                                          terminal_output =&gt; 1,
                                        );</pre>
<pre>
  #  Use the following call if you wish for the clusters to be written out to files. 
  #  Each cluster will be deposited in a file named 'ClusterX.dat' with X starting 
  #  from 0:</pre>
<pre>
  my $clusterer = Algorithm::KMeans-&gt;new( datafile =&gt; $datafile,
                                          mask     =&gt; $mask,
                                          K        =&gt; $K,
                                          write_clusters_to_files =&gt; 1,
                                        );</pre>
<pre>
  #  FOR ALL CASES ABOVE, YOU'D NEED TO MAKE THE FOLLOWING CALLS ON THE CLUSTERER 
  #  INSTANCE TO ACTUALLY CLUSTER THE DATA:</pre>
<pre>
  $clusterer-&gt;read_data_from_file();
  $clusterer-&gt;kmeans();</pre>
<pre>
  #  If you want to directly access the clusters and the cluster centers in your 
  #  top-level script:</pre>
<pre>
  my ($clusters, $cluster_centers) = $clusterer-&gt;kmeans();</pre>
<pre>
  #  You can now access the symbolic data names in the clusters directly, as in:</pre>
<pre>
  foreach my $cluster (@$clusters) {
      print &quot;Cluster:   @$cluster\n\n&quot;
  }</pre>
<pre>
  # CLUSTER VISUALIZATION:</pre>
<pre>
  #  You must first set the mask for cluster visualization. This mask tells the 
  #  module which 2D or 3D subspace of the original data space you wish to visualize 
  #  the clusters in:</pre>
<pre>
  my $visualization_mask = &quot;111&quot;;
  $clusterer-&gt;visualize_clusters($visualization_mask);</pre>
<pre>
  # SYNTHETIC DATA GENERATION:</pre>
<pre>
  #  The module has been provided with a class method for generating multivariate 
  #  data for experimenting with clustering.  The data generation is controlled 
  #  by the contents of the parameter file that is supplied as an argument to the 
  #  data generator method.  The mean and covariance matrix entries in the parameter 
  #  file must be according to the syntax shown in the param.txt file in the examples 
  #  directory. It is best to edit this file as needed:</pre>
<pre>
  my $parameter_file = &quot;param.txt&quot;;
  my $out_datafile = &quot;mydatafile.dat&quot;;
  Algorithm::KMeans-&gt;cluster_data_generator(
                          input_parameter_file =&gt; $parameter_file,
                          output_datafile =&gt; $out_datafile,
                          number_data_points_per_cluster =&gt; $N );</pre>
<p>
</p>
<hr />
<h1><a name="changes">CHANGES</a></h1>
<p>Version 1.40 includes a <code>smart</code> option for seeding the
clusters.  This option, supplied through the constructor
parameter <code>cluster_seeding</code>, means that the clusterer will
(1) Subject the data to principal components analysis in
order to determine the maximum variance direction; (2)
Project the data onto this direction; (3) Find peaks in a
smoothed histogram of the projected points; and (4) Use the
locations of the highest peaks as initial guesses for the
cluster centers.  If you don't want to use this option, set
<code>cluster_seeding</code> to <code>random</code>. That should work as in the
previous version of the module.</p>
<p>Version 1.30 includes a bug fix for the case when the
datafile contains empty lines, that is, lines with no data
records.  Another bug fix in Version 1.30 deals with the
case when you want the module to figure out how many
clusters to form (this is the <code>K=0</code> option in the
constructor call) and the number of data records is close to
the minimum.</p>
<p>Version 1.21 includes fixes to handle the possibility that,
when clustering the data for a fixed number of clusters, a
cluster may become empty during iterative calculation of
cluster assignments of the data elements and the updating of
the cluster centers.  The code changes are in the
<code>assign_data_to_clusters()</code> and <code>update_cluster_centers()</code>
subroutines.</p>
<p>Version 1.20 includes an option to normalize the data with
respect to its variability along the different coordinates
before clustering is carried out.  This can be a useful
option for highly non-isotropic data, that is, the data in
which the different coordinate values along the different
dimensions vary differently.  (BUT BEWARE THAT IF THE
OVERALL DATA VARIANCE ALONG A DIMENSION IS CAUSED MORE BY
THE SEPARATION BETWEEN THE MEANS THAN BY THE INTRA-CLUSTER
VARIABILITY, THE DATA NORMALIZATION OF THE SORT IN VERSION
1.20 COULD ACTUALLY DECREASE THE PERFORMANCE OF THE
CLUSTERER.)  With version 1.20, you can also visualize the
raw data and the normed data to see the effects of data
normalization.  Another reason for Version 1.20 is to get
away from multi-part version numbers like 1.x.x.  As I
discovered (thanks to an email from Steffen Mueller), it is
never a good idea to mix version numbers like 1.1, which
look like regular floating-point numbers to Perl, and
multi-part version numbers like 1.1.1 (which Perl interprets
as 1.001001).</p>
<p>Version 1.1.1 allows for range limiting the values of <code>K</code>
to search through.  <code>K</code> stands for the number of clusters
to form.  This version also declares the module dependencies
in the <code>Makefile.PL</code> file.</p>
<p>Version 1.1 is a an object-oriented version of the
implementation presented in version 1.0.  The current
version should lend itself more easily to code extension.
You could, for example, create your own class by subclassing
from the class presented here and, in your subclass, use
your own criteria for the similarity distance between the
data points and for the QoC (Quality of Clustering) metric,
and, possibly a different rule to stop the iterations.
Version 1.1 also allows you to directly access the clusters
formed and the cluster centers in your calling script.</p>
<p>
</p>
<hr />
<h1><a name="description">DESCRIPTION</a></h1>
<p><strong>Algorithm::KMeans</strong> is a <em>perl5</em> module for the clustering
of numerical data in multidimensional spaces.  Since the
module is entirely in Perl (in the sense that it is not a
Perl wrapper around a C library that actually does the
clustering), the code in the module can easily be modified
to experiment with several aspects of automatic clustering.
For example, one can change the criterion used to measure
the &quot;distance&quot; between two data points, the stopping
condition for accepting final clusters, the criterion used
for measuring the quality of the clustering achieved, etc.</p>
<p>A K-Means clusterer is a poor man's implementation of the EM
algorithm.  EM stands for Expectation Maximization. For the
case of isotropic Gaussian data, the results obtained with a
good K-Means implementation should match those obtained with
the EM algorithm.  (When the data is non-isotropic but the
nature of anisotropy is the same for all the clusters, the
results you obtain with a K-Means clusterer may be improved
--- but only under certain circumstances --- by first
normalizing the data appropriately, as can done with the
implementation shown here when you set the
<code>do_variance_normalization</code> option in the KMeans
constructor.  But, as pointed out elsewhere in this
documentation, such normalization may actually decrease the
performance of the clusterer if the overall data variability
along any dimension is more a result of the separation
between the means than a consequence of intra-cluster
variability.)  Clustering with K-Means takes place
iteratively and involves two steps: 1) assignment of data
samples to clusters; and 2) Recalculation of the cluster
centers.  The assignment step can be shown to be akin to the
Expectation step of the EM algorithm, and the calculation of
the cluster centers akin to the Maximization step of the EM
algorithm.</p>
<p>Of the two key steps of the K-Means algorithm, the
assignment step consists of assigning each data point to
that cluster from whose center the data point is the
closest.  That is, during assignment, you compute the
distance between the data point and each of the current
cluster centers.  You assign the data sample on the basis of
the minimum value of the computed distance.  The second step
consists of re-computing the cluster centers for the newly
modified clusters.</p>
<p>Obviously, before the two-step approach can proceed, we need
to initialize the both the cluster center values and the
clusters that can then be iteratively modified by the
two-step algorithm.  How this initialization is carried out
is very important.  Starting with Version 1.40, you now have
two very different ways for carrying out this
initialization.  The default option, called the <code>smart</code>
option, consists of subjecting the data to principal
components analysis to discover the direction of maximum
variance in the data space.  The data points are then
projected on to this direction and a histogram constructed
from the projections.  Centers of the smoothed histogram are
used to seed the clustering operation.  The other option,
which is the older option, is to choose the cluster centers
purely randomly.  You get the first option if you set
<code>cluster_seeding</code> to <code>smart</code> in the constructor, and you get
the second option if you set it to <code>random</code>.</p>
<p>How to specify K is one of the most vexing issues in any
approach to clustering.  In some case, we can set K on the
basis of prior knowledge.  But, more often than not, no such
prior knowledge is available.  When the programmer does not
explicitly specify a value for K, the approach taken in the
current implementation is to try all possible values between
2 and some largest possible value that makes statistical
sense.  We then choose that value for K which yields the
best value for the QoC (Quality of Clustering) metric.  It
is generally believed that the largest value for K should
not exceed sqrt(N/2) where N is the number of data point to
be clustered.</p>
<p>How to set the QoC metric is obviously a critical issue unto
itself.  In the current implementation, the value of QoC is
a ratio of the average radius of the clusters and the
average distance between the cluster centers.  But note that
this is a good criterion only when the data exhibits the
same variance in all directions.  When the data variance is
different directions, but still remains the same for all
clusters, a more appropriate QoC can be formulated using
other distance metrics such as the Mahalanobis distance.</p>
<p>Every iterative algorithm requires a stopping criterion.
The criterion implemented here is that we stop iterations
when there is no re-assignment of the data points during the
assignment step.</p>
<p>Ordinarily, the output produced by a K-Means clusterer will
correspond to a local minimum for the QoC values, as opposed
to a global minimum.  The current implementation protects
against that when the clusterer constructor is called with
the <code>random</code> option for <code>cluster_seeding</code>, but only in a
very small way, by trying different randomly selected
initial cluster centers and then selecting the one that
gives the best overall QoC value.</p>
<p>
</p>
<hr />
<h1><a name="methods">METHODS</a></h1>
<p>The module provides the following methods for clustering,
for cluster visualization, for data visualization, and for
the generation of data for testing a clustering algorithm:</p>
<dl>
<dt><strong><a name="new" class="item"><strong>new()</strong></a></strong></dt>

<dd>
<pre>
    my $clusterer = Algorithm::KMeans-&gt;new(datafile        =&gt; $datafile,
                                           mask            =&gt; $mask,
                                           K               =&gt; $K,
                                           cluster_seeding =&gt; 'smart',
                                           terminal_output =&gt; 1,     
                                           write_clusters_to_files =&gt; 1,
                                           debug           =&gt; 0,
                                          );</pre>
<p>A call to <a href="#new"><code>new()</code></a> constructs a new instance of the
<code>Algorithm::KMeans</code> class.  When <code>$K</code> is a non-zero
positive integer, the module will construct exactly that
many clusters.  However, when <code>$K</code> is 0, the module will
find the best number of clusters to partition the data into.
As explained in the Description, setting <code>cluster_seeding</code> to
<code>smart</code> causes PCA (principal components analysis) to be
used for discovering the best choices for the initial
cluster centers.  If you want purely random decisions to be
made for the initial choices for the cluster centers, set
<code>cluster_seeding</code> to <code>random</code>.</p>
<p>The data file is expected to contain entries in the
following format</p>
<pre>
   c20  0  10.7087017086940  9.63528386251712  10.9512155258108  ...
   c7   0  12.8025925026787  10.6126270065785  10.5228482095349  ...
   b9   0  7.60118206283120  5.05889245193079  5.82841781759102  ...
   ....
   ....</pre>
<p>where the first column contains the symbolic ID tag for each
data record and the rest of the columns the numerical
information.  As to which columns are actually used for
clustering is decided by the string value of the mask.  For
example, if we wanted to cluster on the basis of the entries
in just the 3rd, the 4th, and the 5th columns above, the
mask value would be <code>N0111</code> where the character <code>N</code>
indicates that the ID tag is in the first column, the
character <code>0</code> that the second column is to be ignored, and
the <code>1</code>'s that follow that the 3rd, the 4th, and the 5th
columns are to be used for clustering.</p>
<p>The parameter <code>terminal_output</code> is boolean; when not
supplied in the call to <a href="#new"><code>new()</code></a> it defaults to 0.  When set,
this parameter determines what you will see on the terminal
screen of the window in which you make these method calls.
When set to 1, you will see on the terminal screen the
different clusters as lists of the symbolic IDs and their
cluster centers. You will also see the QoC (Quality of
Clustering) value for the clusters displayed.</p>
<p>The parameter <code>write_clusters_to_files</code> is boolean; when
not supplied in the call to <a href="#new"><code>new()</code></a>, it defaults to 0.  When
set to 1, the clusters are written out to files named</p>
<pre>
     Cluster0.dat 
     Cluster1.dat 
     Cluster2.dat
     ...
     ...</pre>
<p>Before the clusters are written to these files, the module
destroys all files with such names in the directory in which
you call the module.</p>
<p>If you wish for the clusterer to search through a
<code>(Kmin,Kmax)</code> range of values for <code>K</code>, the constructor
should be called in the following fashion:</p>
<pre>
    my $clusterer = Algorithm::KMeans-&gt;new(datafile =&gt; $datafile,
                                           mask     =&gt; $mask,
                                           Kmin     =&gt; 3,
                                           Kmax     =&gt; 10,
                                           cluster_seeding =&gt; 'smart',
                                           terminal_output =&gt; 1,     
                                           debug    =&gt; 0,
                                          );</pre>
<p>where obviously you can choose any reasonable values for
<code>Kmin</code> and <code>Kmax</code>.  If you choose a value for <code>Kmax</code> that
is statistically too large, the module will let you
know. Again, you may choose <code>random</code> for
<code>cluster_seeding</code>, the default value being <code>smart</code>.</p>
<p>If you believe that the individual clusters in your data are
very anisotropic (that is, you believe that intra-cluster
variability in your data is different along the different
dimensions), you might get better clustering by first
normalizing the data coordinates by the standard-deviations
along those directions.  But how to use a reasonable value
for such a standard-deviation becomes a big issue unto
itself.  (The implementation shown here uses the overall
data standard-deviation along a direction for the
normalization in that direction.  As mentioned elsewhere in
the documentation, such a normalization could backfire on
you if the data variability along a dimension is more a
result of the separation between the means than a
consequence of the intra-cluster variability.)  You can turn
on the data normalization by turning on the
<code>do_variance_normalization</code> option in the constructor, as
in</p>
<pre>
    my $clusterer = Algorithm::KMeans-&gt;new( datafile =&gt; $datafile,
                                            mask     =&gt; &quot;N111&quot;,   
                                            K        =&gt; 2,        
                                            terminal_output =&gt; 1,
                                            do_variance_normalization =&gt; 1,
    );</pre>
</dd>
<dt><strong><a name="read_data_from_file" class="item"><strong>read_data_from_file()</strong></a></strong></dt>

<dd>
<pre>
    $clusterer-&gt;read_data_from_file()</pre>
</dd>
<dt><strong><a name="kmeans" class="item"><strong>kmeans()</strong></a></strong></dt>

<dd>
<pre>
    $clusterer-&gt;kmeans();</pre>
<pre>
    or</pre>
<pre>
    my ($clusters, $cluster_centers) = $clusterer-&gt;kmeans();</pre>
<p>The first call above works solely by side-effect.  The
second call also returns the clusters and the cluster
centers.</p>
</dd>
<dt><strong><a name="get_k_best" class="item"><strong>get_K_best()</strong></a></strong></dt>

<dd>
<pre>
    $clusterer-&gt;get_K_best();</pre>
<p>This call makes sense only if you supply either the <code>K=0</code>
option to the constructor, or you specify values for the
<code>Kmin</code> and <code>Kmax</code> options. The <code>K=0</code> and the
<code>(Kmin,Kmax)</code> options cause the KMeans algorithm to figure
out on its own the best value for <code>K</code>.  Remember, <code>K</code> is the
number of clusters the data is partitioned into.</p>
</dd>
<dt><strong><a name="show_qoc_values" class="item"><strong>show_QoC_values()</strong></a></strong></dt>

<dd>
<pre>
    $clusterer-&gt;show_QoC_values();</pre>
<p>presents a table with <code>K</code> values in the left column and the
corresponding QoC (Quality-of-Clustering) values in the
right column.  Note that this call makes sense only if you
either supply the <code>K=0</code> option to the constructor, or you
specify values for the <code>Kmin</code> and <code>Kmax</code> options.</p>
</dd>
<dt><strong><a name="visualize_clusters" class="item"><strong>visualize_clusters()</strong></a></strong></dt>

<dd>
<pre>
    $clusterer-&gt;visualize_clusters( $visualization_mask )</pre>
<p>The visualization mask here does not have to be identical to
the one used for clustering, but must be a subset of that
mask.  This is convenient for visualizing the clusters in
two- or three-dimensional subspaces of the original space.</p>
</dd>
<dt><strong><a name="visualize_data" class="item"><strong>visualize_data()</strong></a></strong></dt>

<dd>
<pre>
    $clusterer-&gt;visualize_data($visualization_mask, 'original');</pre>
<pre>
    $clusterer-&gt;visualize_data($visualization_mask, 'normed');</pre>
<p>This method requires a second argument and, as shown, it
must be either the string <code>original</code> or the string
<code>normed</code>, the former for the visualization of the raw
data and the latter for the visualization of the data after
its different dimensions are normalized by the
standard-deviations along those directions.  If you call the
method with the second argument set to <code>normed</code>, but do
so without turning on the <code>do_variance_normalization</code>
option in the KMeans constructor, it will let you know.</p>
</dd>
<dt><strong><a name="cluster_data_generator" class="item"><strong>cluster_data_generator()</strong></a></strong></dt>

<dd>
<pre>
    Algorithm::KMeans-&gt;cluster_data_generator(
                            input_parameter_file =&gt; $parameter_file,
                            output_datafile =&gt; $out_datafile,
                            number_data_points_per_cluster =&gt; 20 );</pre>
<p>for generating multivariate data for clustering if you wish
to play with synthetic data for clustering.  The input
parameter file contains the means and the variances for the
different Gaussians you wish to use for the synthetic data.
See the file <code>param.txt</code> provided in the examples
directory.  It will be easiest for you to just edit this
file for your data generation needs.  In addition to the
format of the parameter file, the main constraint you need
to observe in specifying the parameters is that the
dimensionality of the covariance matrix must correspond to
the dimensionality of the mean vectors.  The multivariate
random numbers are generated by calling the <code>Math::Random</code>
module.  As you would expect, this module requires that the
covariance matrices you specify in your parameter file be
symmetric and positive definite.  Should the covariances in
your parameter file not obey this condition, the
<code>Math::Random</code> module will let you know.</p>
</dd>
</dl>
<p>
</p>
<hr />
<h1><a name="how_are_the_clusters_output">HOW ARE THE CLUSTERS OUTPUT?</a></h1>
<p>When the option <code>terminal_output</code> is set in the call to the
constructor, the clusters are displayed on the terminal
screen.</p>
<p>When the option <code>write_clusters_to_files</code> is set in the
call to the constructor, the module dumps the clusters in
files named</p>
<pre>
    Cluster0.dat
    Cluster1.dat
    Cluster2.dat
    ...
    ...</pre>
<p>in the directory in which you execute the module.  The
number of such files will equal the number of clusters
formed.  All such existing files in the directory are
destroyed before any fresh ones are created.  Each cluster
file contains the symbolic ID tags of the data points in
that cluster.</p>
<p>
</p>
<hr />
<h1><a name="required">REQUIRED</a></h1>
<p>This module requires the following three modules:</p>
<pre>
   Math::Random
   Graphics::GnuplotIF
   Math::GSL</pre>
<p>the first for generating the multivariate random numbers,
the second for the visualization of the clusters, and the
last for access to the Perl wrappers for the GNU Scientific
Library.  The <code>Matrix</code> module of this library is used for
the PCA of the data when clustering is done with the
<code>smart</code> mode for cluster seeding.</p>
<p>
</p>
<hr />
<h1><a name="examples">EXAMPLES</a></h1>
<p>See the examples directory in the distribution for how to
make calls to the clustering and the visualization methods.
The examples directory also includes a parameter file,
param.txt, for generating synthetic data for clustering.
Just edit this file if you would like to generate your own
multivariate data for clustering.  The parameter file is for
the 3D case, but you can generate data with any
dimensionality through appropriate entries in the parameter
file.</p>
<p>
</p>
<hr />
<h1><a name="export">EXPORT</a></h1>
<p>None by design.</p>
<p>
</p>
<hr />
<h1><a name="caveats">CAVEATS</a></h1>
<p>Please note that this clustering module is not meant for
very large datafiles.  Being an all-Perl implementation, the
goal here is not the speed of execution.  On the contrary,
the goal is to make it easy to experiment with the different
facets of K-Means clustering.  If you need to process a
large data file, you'd be better off with a module like
Algorithm::Cluster.  However note that when you use a
wrapper module in which it is a C library that is actually
doing the job of clustering for you, it is more difficult to
experiment with the various aspects of clustering.  At the
least, you have to recompile the code for every change you
make to the source code of a low-level library.  You are
spared that frustration with an all-Perl implementation.</p>
<p>Clustering usually does not work well when the data is
highly anisotropic, that is, when the data has very
different variances along its different dimensions.  This
problem becomes particularly severe when the different
clusters you expect to see in the data have <em>non-uniform</em>
anisotropies.  When the anisotropies are uniform, one can
try to improve the performance of a clusterer by first
normalizing the data coordinates along a direction by an
average of the intra-cluster standard-deviations along that
direction.  But how to obtain even a rough estimate of such
standard deviations leads you to chicken-and-egg sort of
problems.  The current implementation takes the low road
and, when you turn on the data normalization in the KMeans
constructor, normalizes each data coordinate value by the
overall data standard deviation along that direction.
However, as described elsewhere, this may actually reduce
the performance of the clusterer if the data variability
along a direction is more a result of the separation between
the means than because of intra-cluster variability.  For
better clustering, one could also try to cluster the data in
a low-dimensional space formed by a principal components
analysis of the data.  Depending on how the current module
is received, its future versions may include that
enhancement.</p>
<p>
</p>
<hr />
<h1><a name="bugs">BUGS</a></h1>
<p>Please notify the author if you encounter any bugs.  When
sending email, please place the string 'KMeans' in the
subject line.</p>
<p>
</p>
<hr />
<h1><a name="installation">INSTALLATION</a></h1>
<p>The usual</p>
<pre>
    perl Makefile.PL
    make
    make test
    make install</pre>
<p>if you have root access.  If not,</p>
<pre>
    perl Makefile.PL prefix=/some/other/directory/
    make
    make test
    make install</pre>
<p>
</p>
<hr />
<h1><a name="thanks">THANKS</a></h1>
<p>It was an email from Nadeem Bulsara that prompted me to
create Version 1.40 of this module.  Working with Version
1.30, Nadeem noticed that occasionally the module would
produce variable clustering results on the same dataset.  I
believe that this variability was caused (at least partly)
by the purely random mode that was used in Version 1.30 for
the seeding of the cluster centers.  Version 1.40 now
includes a <code>smart</code> mode. With the new mode the clusterer
uses a PCA (Principal Components Analysis) of the data to
make good guesses for the cluster centers.  However,
depending on how the data is jumbled up, it is possible that
the new mode will not produce uniformly good results in all
cases.  So you can still use the old mode by setting
<code>cluster_seeding</code> to <code>random</code> in the constructor.
Thanks Nadeem for your feedback!</p>
<p>Version 1.30 resulted from Martin Kalin reporting problems
with a very small data set. Thanks Martin!</p>
<p>Version 1.21 came about in response to the problems
encountered by Luis Fernando D'Haro with version 1.20.
Although the module would yield the clusters for some of its
runs, more frequently than not the module would abort with
an &quot;empty cluster&quot; message for his data. Luis Fernando has
also suggested other improvements (such as clustering
directly from the contents of a hash) that I intend to make
in future versions of this module.  Thanks Luis Fernando.</p>
<p>Chad Aeschliman was kind enough to test out the interface of
this module and to give suggestions for its improvement.  His
key slogan: &quot;If you cannot figure out how to use a module in
under 10 minutes, it's not going to be used.&quot;  That should
explain the longish Synopsis included here.</p>
<p>
</p>
<hr />
<h1><a name="author">AUTHOR</a></h1>
<p>Avinash Kak, <a href="mailto:kak@purdue.edu">kak@purdue.edu</a></p>
<p>If you send email, please place the string &quot;KMeans&quot; in your
subject line to get past my spam filter.</p>
<p>
</p>
<hr />
<h1><a name="copyright">COPYRIGHT</a></h1>
<p>This library is free software; you can redistribute it and/or
modify it under the same terms as Perl itself.</p>
<pre>
 Copyright 2012 Avinash Kak</pre>

</body>

</html>