Math::SimpleHisto::XS - Simple histogramming, but kinda fast
use Math::SimpleHisto::XS; my $hist = Math::SimpleHisto::XS->new( min => 10, max => 20, nbins => 1000, ); $hist->fill($x); $hist->fill($x, $weight); $hist->fill(\@xs); $hist->fill(\@xs, \@ws); my $data_bins = $hist->all_bin_contents; # get bin contents as array ref my $bin_centers = $hist->bin_centers; # dito for the bins
This module implements simple 1D histograms with fixed or variable bin size. The implementation is mostly in C with a thin Perl layer on top.
If this module isn't powerful enough for your histogramming needs, have a look at the powerful-but-experimental SOOT module or submit a patch.
The lower bin boundary is considered part of the bin. The upper bin boundary is considered part of the next bin or overflow.
Bin numbering starts at 0
.
Nothing is exported by this module into the calling namespace by default. You can choose to export the following constants:
INTEGRAL_CONSTANT
Or you can use the import tag ':all'
to import all.
This module implements histograms with both fixed and variable bin sizes. Fixed bin size means that all bins in the histogram have the same size. Implementation-wise, this means that finding a bin in the histogram, for example for filling, takes constant time (O(1)).
For variable width histograms, each bin can have a different size. Finding a bin is implemented with a binary search, which has logarithmic run-time complexity in the number of bins O(log n).
new
Constructor, takes named arguments. In order to create a fixed bin size histogram, the following parameters are mandatory:
The lower boundary of the histogram.
The upper boundary of the histogram.
The number of bins in the histogram.
On the other hand, for creating variable width bin size histograms, you must provide only the bins
parameter with a reference to an array of nbins + 1
bin boundaries. For example,
my $hist = Math::SimpleHisto::XS->new( bins => [1.5, 2.5, 4.0, 6.0, 8.5] );
creates a histogram with four bins:
[1.5, 2.5) [2.5, 4.0) [4.0, 6.0) [6.0, 8.5)
fill
Fill data into the histogram. Takes one or two arguments. The first must be the coordinate that determines where data is to be added to the histogram. The second is optional and can be a weight for the data to be added. It defaults to 1
.
If the coordinate is a reference to an array, it is assumed to contain many data points that are to be filled into the histogram. In this case, if the weight is used, it must also be a reference to an array of weights.
fill_by_bin
Fills data into the histogram and works like fill()
, but the first argument (the value(s)) must be bin numbers instead of coordinates.
min
, max
, nbins
, width
, highest_bin
Return static histogram attributes: minimum coordinate, maximum coordinate, number of bins, total width of the histogram, and the index of the highest bin in the histogram (which is just nbins - 1
).
underflow
, overflow
Return the accumulated contents of the under- and overflow bins (which have the ranges from (-inf, min)
and [max, inf)
respectively).
total
The total sum of weights that have been filled into the histogram, excluding under- and overflow.
nfills
The total number of fill operations (currently including fills that fill into under- and overflow, but this is subject to change).
binsize
Returns the size of a bin. For histograms with variable width bin sizes, the size of the bin with the provided index is returned (defaults to the first bin). Example:
$hist->binsize(12);
Returns the size of the 13th bin.
all_bin_contents
, bin_content
$hist->all_bin_contents()
returns the contents of all histogram bins as a reference to an array. This is not the internal storage but a copy.
$hist->bin_content($ibin)
returns the content of a single bin.
bin_centers
, bin_center
$hist->bin_centers()
returns a reference to an array containing the coordinates of all bin centers.
$hist->bin_center($ibin)
returns the coordinate of the center of a single bin.
bin_lower_boundaries
, bin_lower_boundary
Same as bin_centers
and bin_center
respectively, but for the lower boundary coordinate(s) of the bin(s). Note that this lower boundary is considered part of the bin.
bin_upper_boundaries
, bin_upper_boundary
Same as bin_centers
and bin_center
respectively, but for the upper boundary coordinate(s) of the bin(s). Note that this lower boundary is not considered part of the bin.
find_bin
$hist->find_bin($x)
returns the bin number of the bin in which the given coordinate falls. Returns undef if the coordinate is outside the histogram range.
set_bin_content
$hist->set_bin_content($ibin, $content)
sets the content of a single bin.
set_underflow
, set_overflow
$hist->set_underflow($content)
sets the content of the underflow bin. set_overflow
does the obvious.
set_nfills
$hist->set_nfills($n)
sets the number of fills.
set_all_bin_contents
Given a reference to an array containing numbers, sets the contents of each bin in the histogram to the number in the respective array element. Number of elements needs to match the number of bins in the histogram.
clone
, new_alike
$hist->clone()
clones the object entirely.
$hist->new_alike()
clones the parameters of the object, but resets the contents of the clone.
new_from_bin_range
, new_alike_from_bin_range
$hist->new_from_bin_range($first_bin, $last_bin)
creates a copy of the histogram including all bins from $first_bin
to $last_bin
. For example, $hist->new_from_bin_range(50, 199)
would create a new histogram with 150 bins (the range is inclusive!) and copy the respective data from the original histogram. All bin contents outside the range will be added to the under- or overflow respectively. Specifying a last bin above the highest bin number of the source histogram yields a new histogram running up to the highest bin of the source.
$hist->new_alike_from_bin_range($first_bin, $last_bin)
does the same, but resets all contents (like new_alike
).
rebin
Given a rebinning factor, clones the current histogram and modifies it to have $rebin_factor
times fewer bins. You can only rebin by factors that divide the number of bins of the input histogram.
For example, you can rebin a histogram with 200 bins by a factor of 10. This results in a histogram with 20 bins. You cannot rebin the same histogram by a factor of 7 because 7 does not divide 200 without remainder.
add_histogram
Given another histogram object, this method will add the content of that object to the invocant's content. This works only if the binning of the histograms is exactly the same. Throws an exception if that is not the case.
subtract_histogram
Given another histogram object, this method will subtract the content of that object from the invocant's content. This works only if the binning of the histograms is exactly the same. Throws an exception if that is not the case.
integral
Returns the integral over the histogram. Very limited at this point. Usage:
my $integral = $hist->integral($from, $to, TYPE);
Where $from
and $to
are the integration limits and the optional TYPE
is a constant indicating the method to use for integration. Currently, only INTEGRAL_CONSTANT
is implemented (and assumed as the default). This means that the bins will be treated as rectangles, but fractional bins are treated correctly.
If the integration limits are outside the histogram boundaries, there is no warning, the integration is silently performed within the range of the histogram.
mean
Calculates the mean of the histogram contents.
Note that the result is not usually the same as if you calculated the mean of the input data directly due to the effect of the binning.
standard_deviation
Calculates the standard deviation of the histogram contents.
Note that the result is not usually the same as if you calculated the std. dev. of the input data directly due to the effect of the binning.
First parameter may be the previously calculated mean to avoid recalculating it. If not provided, it will be calculated on the fly.
median
Calculates and returns the estimated median of the data in the histogram. Achieves sub-bin-size resolution by estimating the median position within the bin from the sum of data below and above the median bin.
The estimation is necessary since the true median requires the original data.
median_absolute_deviation
WARNING this is apparently still crashy when facing weird data!
Calculates and returns an estimate of the median absolute deviation (MAD) of the histogram. This is a fairly expensive operation.
Optionally, as an optimization, you can pass in the previously calculated median estimate of the histogram to prevent it from having to be recalculated. Make sure you pass in the correct value or the behaviour of this method is undefined and might even crash your perl!
normalize
Normalizes the histogram to the parameter of the $hist->normalize($total)
call. Normalization defaults to 1
.
cumulative
Calculates the cumulative histogram of the invocant histogram and returns it as a new histogram object.
The cumulative (if done in Perl) is:
for my $i (0..$n) { $content[$i] = sum(map $original_content[$_], 0..$i); }
As a convenience, if a numeric argument is passed to the method, the OUTPUT histogram will be normalized using number BEFORE calculating the cumulation. This means that
my $cumu = $histo->cumulative(1.);
gives a cumulative histogram where the last bin contains exactly 1
.
multiply_constant
Scales all bin contents, as well as over- and underflow by the given constant.
This module comes with a Mersenne twister-based Random Number Generator that follows that in the Math::Random::MT
module. It is available in the Math::SimpleHisto::XS::RNG
class. You can create a new RNG by passing one or more integers to the Math::SimpleHisto::XS::RNG->new(...)
method. The object's rand()
method works like the normal Perl rand($x)
function.
You can use a histogram as a source for random numbers that follow the distribution of the histogram.
push @random_like_hist, $hist->rand() for 1..100000;
If you pass a Math::SimpleHisto::XS::RNG
object to the call to rand()
, that random number generator will be used.
rand
Optionally given a Math::SimpleHisto::XS::RNG object (a random number generator), this returns a random number that is drawn from the distribution of the histogram.
This class defines serialization hooks for the Storable module. Therefore, you can simply serialize objects using the usual
use Storable; my $string = Storable::nfreeze($histogram); # ... later ... my $histo_object = Storable::thaw($string);
Currently, this mechanism hardcodes the use of the simple
dump format. This is subject to change!
If at all possible, the de-serialization routine new_from_dump
will be maintained in such a way that it will be able to deserialize dumps of histograms that were done with earlier versions of this module. If a new version of this module can not at all achieve this, that will be mentioned prominently in the change log.
The other way around, serialized histograms are not generally backwards-compatible across major versions. That means you cannot deserialize a dump made with version 1.01 of this module using version 0.05. Such backwards-incompatible changes will always be accompanied with major version number changes (0.X => 1.X, 1.X => 2.X...).
The various serialization formats that this module supports (see the dump
documentation below) all have various pros and cons. For example, the native_pack
format is by far the fastest, but is not portable. The simple
format is a very simple-minded text format, but it is portable and performs well (comparable to the JSON
format when using JSON::XS
, other JSON modules will be MUCH slower). Of all formats, the YAML
format is the slowest. See xt/bench_dumping.pl for a simple benchmark script.
None of the serialization formats currently supports compression, but the native_pack
format produces the smallest output at about half the size of the JSON output. The simple
format is close to JSON
for all but the smallest histograms, where it produces slightly smaller dumps. The YAML
produced is a bit bigger than the JSON
.
dump
This module has fairly simple serialization methods. Just call the dump
method on an object of this class and provide the type of serialization desire. Currently valid serializations are simple
, JSON
, YAML
, and native_pack
. Case doesn't matter.
For YAML
support, you need to have the YAML::Tiny
module available. For JSON
support, you need any of JSON::XS
, JSON::PP
, or JSON
. The three modules are tried in order at compile time. The chosen implementation can be polled by looking at the $Math::SimpleHisto::XS::JSON_Implementation
variable. It contains the module name. Setting this vairable has no effect.
The simple serialization format is a home grown text format that is subject to change, but in all likeliness, there will be some form of version migration code in the deserializer for backwards compatibility.
All of the serialization formats except for native_pack
are text-based and thus portable and endianness-neutral.
native_pack
should not be used when the serialized data is transferred to another machine.
new_from_dump
Given the type of the dump (simple
, JSON
, YAML
, native_pack
) and the actual dump string, creates a new histogram object from the contained data and returns it.
Deserializing JSON
and YAML
dumps requires the respective support modules to be available. See above.
SOOT is a dynamic wrapper around the ROOT C++ library which does histogramming and much more. Beware, it is experimental software.
Serialization can make use of the JSON::XS, JSON::PP, JSON or YAML::Tiny modules. You may want to use the convenient Storable module for transparent serialization of nested data structures containing objects of this class.
This module contains some code written by Abhijit Menon-Sen, who wrote Math::Random::MT
.
Steffen Mueller, <smueller@cpan.org>
Copyright (C) 2011, 2012, 2013, 2014 by Steffen Mueller
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.1 or, at your option, any later version of Perl 5 you may have available.