The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

PDL::Fit::ExpRate - fitting noisy exponential decay

VERSION

This document describes PDL::Fit::ExpRate version 0.03

SYNOPSIS

 use PDL;
 use PDL::Fit::ExpRate;
 
 # Load x/y data, or generate some noisy stuff:
 my $xs = sequence(100)/10;
 my ($A, $B, $tau) = (5, 4, -10);
 my $ys = $A + $B * exp($xs / $tau) + $xs->grandom;
 
 # Extract the parameters
 my ($fit_A, $fit_B, $fit_tau) = fit_exp_rate($xs, $ys);
 
 print "A was $A; fit as $fit_A\n";
 print "B was $B; fit as $fit_B\n";
 print "tau was $tau; fit as $fit_tau\n";
 
 # Other useful functions
 
 # Solve for $coefs in $A * $coefs = $y
 # where $A is 3x3
 my $A = pdl q[1 0 0 ; 0 2 0 ; 0 0 3];
 my $y = pdl q[5     ,   -4  ,     6];
 my $coefs = three_by_three_Householder($A, $y);
 
 # Perform a quadratic fit
 # y = coefs[0] + coefs[1] * x + coefs[2] * x**2
 my $coefs = fit_quadratic($xs, $ys);
 
 # Use a callback to get more information from the fit
 ($fit_A, $fit_B, $fit_tau) = fit_exp_rate($xs, $ys,
     run_each_fit => sub {
         my %info = @_;
         print "Full details at end of fit:\n";
         for my $k (sort keys %info) {
             print " $k => $info{$k}\n";
         }
     }
 );

DESCRIPTION

This module provides a handy function for fitting time series data to a single exponential growth or decay. A number of methods used by this method are also made available, including a method for fitting data to a second-order polynomial, and performing Householder diagonalization on 3x3 matrix systems, and various parameter estimation and single-step functions.

The naive approach to fitting data to an exponential decay involves taking the logarithm of the y-values and fitting the resulting transformed data to a line. This has a number of drawbacks. First, it leads to a bias in the errors: errors in the data associated with smaller y-values receive a stronger weight than those associated with larger y-values. Second, there is no simple way to handle an additive offset such as A in y = A + B * e**(x/tau). Third and most important, the transformation technique breaks down if the noise leads to any data values that are negative. The only simple way to handle such data is to discard it. An approach involving a logarithmic transformation is biased and can only be used under limited circumstances.

The fitting method supplied by this module, "fit_exp_rate", takes a different and much more robust approach that does not suffer any of those errors. It performs a least-squares fit to an exponential directly to the data using an iterative guarded Newton method. The technique gives accurate results, is highly configurable, and converges rapidly. As an added bonus, the underlying C functions can be accessed by other PDL modules using PDL::PP::Include::Fit::ExpRate.

FUNCTIONS

three_by_three_Householder

Given A and y, solves A * coef = y

This function performs Householder diagonalization on a 3x3 matrix A with the given "answer" y. It effectively inverts the matrix, leading to a solution for coef that can be effectively written as:

 coef = A**-1 * y

where A**-1 is the matrix inverse of A and multiplication is matrix multiplication.

Here's an exampe of how to use it. I've deliberately made the A matrix simple so that you can perform the inversion in your head, but A need not be sparse.

 my $A = pdl q[1 0 0 ; 0 2 0 ; 0 0 3];
 my $y = pdl q[5     ,   -4  ,     6];
 my $coefs = three_by_three_Householder($A, $y);
 my $expected = pdl(5, -2, 2);
 print "Got $coefs; expected $expected\n";

fit_quadratic

Given x and y data, determines A, B, and C for y = A + B x + C x**2

This function determines the least-squares coefficients for a quadratic fit to a given set of x-y data. This is a simple PDL wrapper for the internal C function exprate_fit_quadratic. This usage is pretty simple:

 # Make some noisy data
 my $xs = sequence(50);
 my $ys = 5 + 3 * $x + 4 * $x**2 + grandom($xs);
 
 # should be (5, 3, 4)
 my $coefs = fit_quadratic($xs, $ys);

exp_fit_estimate_parameters

Given x and y data, determines A, B, and lambda for y = A + B exp(x*lambda)

This function obtains an initial estimate for the parameters of an exponential decay. It also returns the sum of the squared errors for the fit. It obtains parameter estimates by fitting the data to a quadratic function, then mapping the coefficients of the quadratic to the coefficients of an exponential function using a Taylor Series. For example:

 # Make some noisy data
 my $xs = sequence(50);
 my $ys = 5 + 3 * exp($xs * 0.04) + grandom($xs) * 0.1;
 print "        A      B   lambda    err\n";
 
 # Print the original parameters
 printf "gave: %1.3f  %1.3f  %1.3f\n", 5, 3, 0.04;
 
 # Get estimates for parameters
 my ($A, $B, $lambda, $sum_sq_err) = exp_fit_estimate_parameters($xs, $ys);
 printf "est:  %1.3f  %1.3f  %1.3f  %1.3f\n", $A, $B, $lambda, $sum_sq_err;

This is a simple PDL wrapper for the internal C function exprate_estimate_parameters provided in PDL::PP::Include::Fit::ExpRate.

exp_fit_newton_method_step

Takes a single gaurded Newton-method step for the iterative fitting process

While the above method, exp_fit_estimate_parameters, can give reasonable estimates for an exponential fit, a proper fit involves an iterative Newton method. Such a method evaluates the local gradient of the error and takes a step in the direction toward the estimated minimum of the error. The method can only estimate where the minimum is located, so it limits the size of its steps so as not to go too far. While the full fittinig routine performs many steps, this function performs only a single step.

This function takes a number of parameters, and returns quite a few as well. The input values include the data as well as the previous best guess at the parameters for A, B, and lambda in the equation

 y = A + B * exp(x * lambda)

The function returns the result of a single gaurded Newton method iteration. Such a step is guaranteed to fall within the specified trust radius, which I explain more in a moment. The function also returns the sum of the squared errors for the new parameters.

 # Make some noisy data
 my $xs = sequence(50);
 my $ys = 5 + 3 * exp($xs * 0.04) + grandom($xs) * 0.1;
 print "          A      B   lambda  err\n";
 
 # Get initial estimates for parameters
 my ($A, $B, $lambda, $sum_sq_err) = exp_fit_estimate_parameters($xs, $ys);
 printf "step 0: %1.3f  %1.3f  %1.3f  %1.3f\n", $A, $B, $lambda, $sum_sq_err;
 
 # Perform many single steps to see how it goes
 my $trust_radius = 0.1;  # trust radius of 10%
 for (1 .. 9) {
   ($A, $B, $lambda, $sum_sq_err)
     = exp_fit_newton_method_step($xs, $ys, $trust_radius, $A, $B, $lambda);
   printf "step $_: %1.3f  %1.3f  %1.3f  %1.3f\n", $A, $B, $lambda, $sum_sq_err;
 }
 
 # Print the original parameters
 print '=' x 35, "\n";
 printf "gave:   %1.3f  %1.3f  %1.3f\n", 5, 3, 0.04;

The trust radius ensures that each of the parameters does not change by more than the specified amount in a given step. The basic idea is that we want to find the minimum in the surface representing the sum of the squared errors, and we do this by locally approximating the surface by a quadratic. This local approximation is only accurate within a small parameter radius, so if the minimum of the local quadratic is too far away, it would be unwise to go all the way to that location. A trust radius of 0.1 (i.e. 10%) means you do not want your estimates to change by more than 10%.

You should avoid a trust radius larger than 1. Such a large trust radius would make it possible to change the sign of your parameters. If you find that you are getting the wrong sign for your parameters, you should find a better initial estimate rather than increase your trust radius.

fit_exp_rate

Fits your data to an exponential decay or growth

This is the primary method of PDL::Fit::ExpRate. This method takes your x and y data and computes a best-fit for the parameters A, B, and tau in the equation

                  x[i] / tau
 y[i] = A + B * e

If a fit fails to converge, the values of A, B, and tau for that dataset will be set to BAD (see PDL::Bad for details).

In additon to the x/y data, you can optionally include A, B, and tau. For most PDL methods, this provides a means to pre-allocate (or use previously allocated) memory for the output, which can be important for large data operations. That's not likely to be the case for this method, but the mechanism is provided for consistency. This can also be used if you want to provide your own initial estimates for A, B, and tau, which you may find useful if the quadratic estimate method is not working well for your data. The method assumes the latter, but if you are simply want to pass along pre-allocated piddles, see the pre_allocated flag below. Either way, the method will overwrite the data in A, B, and tau, as they are considered output piddles.

After piddle arguments x and y (and optionally A, B, and tau), you can specify key/value pairs that govern how the fit is performed. Options include:

pre_allocated

If you pre-allocated piddles for A, B, and tau, and are not providing them as initial estimates, then you should include this key with a boolean true value, i.e. 1.

trust_radius

The trust radius for the fit(s), which governs how much the parameters will evolve with a single iteration. See the documentation for exp_fit_newton_method_step for a more complete explanation of this parameter. The default value is 0.1, i.e. 10%.

iterations

The maximum number of iterations, i.e. the fit must converge before this number of iterations or it will be considered a failed fit. The default value is 50.

threshold

The sum of the squared errors will vary as the fit iterates. As the fit converges, the deviations will get very small. The threshold indicates the deviations that are acceptable to declare convergence. This is not the percent difference in the parameters, but in the sum of the squared errors. The default is 0.001.

Note that for noiseless data, the fitting routine may run into issues with bit noise in its calculations for the sum of the squared errors. If it detects that the sum of the squared errors has fallen into bitnoise, it will deem the system converged and disregard the threshold.

min_lambda, max_lambda

You may have physical reasons for expecting your exponential rates to fall within certain ranges. If so, you can provide the min and max values of lambda that are allowed. (Lambda is the inverse of tau.) The magnitudes of these numbers will be compared with the magnitudes of the lambda value with each iteration of the fit, and if the fit value falls outside of the given range, the fit will be considered a failure. If you do not want to impose any limitations, set both of these to zero. The default value for max_lambda is zero (no limit) but the default value for min_lambda is 1e-8.

run_each_fit

A callback that gets called at the completion of each fit. This was implemented primarily for use in progress indicators and GUI applications. However, it is also useful for getting more information about each fit than is returned by the method itself. The items passed to the callback are key/value pairs (i.e. something that you can assign to a hash) that include:

  N_fits     - number of fits we're doing
  fit_count  - number of fits we've completed
  A          - final value for A for this fit
  B          - final value for B for this fit
  tau        - final value for tau for this fit
  converged  - boolean indicating if the routine will declare
               that this fit has converged
  sum_sq_err - last sum of the squared deviations for the
               just-completed fit
  old_sum_sq_err
             - previous sum of the squared deviations for the
               just-completed fit
  threshold  - threshold, see option above
  N_rounds   - number of rounds required for convergence

Note that if converged is false, the method will eventually return values for A, B, and tau that are BAD. This callback provides a way to get the last (but unsuccessful) values of A, B, and tau, which may be helpful in certain circumstances. Usually, converged will be true if (sum_sq_err - old_sum_sq_err) / old_sum_sq_err < threshold. It might also be true if the data is noise-free and sum_sq_err has fallen to the level of bit noise.

The return value of this callback is meaningful. If it is false, the PDL function will abort without threading over any more datasets. If it is true, the PDL function will continue. This has seen use in GUI programs that initiate a large number of fits. The callback was used to keep the GUI responsive, and to make it possible for the user to cancel the fits mid-calculation.

A simple progress indicator might look like this:

 my ($As, $Bs, $taus) = fit_exp_rate($xs, $ys,
     run_each_fit => sub {
         my %info = @_;
         print "Completed $info{fit_count} of $info{N_fits}\n";
         return 1;
     },
 );
run_each_iteration

A callback that gets called at the completion of each iteration of each fit. As this will get called with each round of the iteration, an iteration callback can have a significant performance impact. As such, this is primarily useful as a debugging aid when your fit fails to converge. In such circumstances, you can add an iteration callback to print how the parameters are evolving, which may help figure out how to tweak the parameters that govern the iterative process.

The items passed to the callback are key/value pairs (i.e. something that you can assign to a hash) that include:

  A           - current value for A for this fit
  B           - current value for B for this fit
  lambda      - current value for 1/tau for this fit
  round       - iteration count
  sum_sq_err  - last sum of the squared deviations for the
                just-completed fit
  old_sum_sq_err
              - previous sum of the squared deviations for
                the just-completed fit
  threshold   - threshold, see option above
  force_round - boolean indicating whether the next round
                is forced, which happens if the Newton
                method step was confined by the trust radius

Of these, force_next_round probably needs a bit of clarification. If the Newton method step was truncated by the trust_radius, then the change in the sum of the squared errors may not have changed much. In that case, the threshold limit could erroneously indicate convergence. For this reason, the method will always perform another iteration when the step gets truncated by the trust radius. The value of force_next_round will indicate if this is the case.

All of these values are either scalars or function references, but are not piddles. So, unlike the trust radius in exp_fit_newton_method_step, if you utilize PDL's threading to perform exponential fits over multiple datasets in one call, they will all use this same value of the trust_radius and the other parameters.

Notice that, unlike exp_fit_newton_method_step, this method does not return the sum of the squared errors. If you want that information, you should supply a per-fit callback which collects the result.

Here are some examples of how to use the function:

 # Make some noisy data
 my $xs = sequence(500);
 my $ys = 5 + 3 * exp($xs * -0.05) + grandom($xs) * 0.5;
 print "   A      B      tau\n";
 
 # Use default parameters
 my ($A, $B, $tau) = fit_exp_rate($xs, $ys);
 printf " %1.3f  %1.3f  %1.3f  -> defaults\n", $A->sclr, $B->sclr, $tau->sclr;
 
 # See how things as we tighten the threshold a bit
 for my $threshold (0.1, 0.01, 0.001, 0.0001, 0.00001) {
     ($A, $B, $tau) = fit_exp_rate($xs, $ys,
       threshold => $threshold
     );
     printf " %1.3f  %1.3f  %1.3f  -> threshold = $threshold\n", $A->sclr, $B->sclr, $tau->sclr;
 }
 
 print '=' x 35, "\n";
 printf " %1.3f  %1.3f  %1.3f  -> values for generating data\n", 5, 3, -1/0.05;
 
 # Use callback to track the convergence of A
 my @As;
 ($A, $B, $tau) = fit_exp_rate($xs, $ys,
     run_each_iteration => sub {
         my %info = @_;
         push @As, $info{A};
     }
 );
 print "\nA:   ", join ("\n  -> ", @As), "\n";
 
 # Use callback to print diagnostics
 print '-' x 30, "\n";
 ($A, $B, $tau) = fit_exp_rate($xs, $ys,
    iterations => 100,
        run_each_iteration => sub {
                my %info = @_;
                return if $info{round} % 5 != 0;
                print "Round $info{round}:\n";
                delete $info{round};
                for my $k (sort keys %info) {
                        print "  $k: $info{$k}\n";
                }
        },
 );

SEE ALSO

There are a number of PDL modules for fitting data, including PDL::Fit::Linfit for linear fitting and PDL::Fit::Polynomial for fitting data to polynomials. Both of those modules depend on Slatec (though they may be installed on your system even if you don't have Slatec available). PDL::Fit::LM provides Levenberg-Marquardt fitting, and PDL::Fit::Gaussian provides methods for fitting a set of data to a normal (Gaussian) distribution.

AUTHOR

David Mertens <dcmertens.perl@gmail.com>

LICENCE AND COPYRIGHT

Documentation is copyright (c) 2012, David Mertens, all rights reserved.

Code is copyright (c) 2012, Northwestern University, all rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.