John Heidemann >
Fsdb >
Fsdb::Filter::dbcolstats

Module Version: 2
dbcolstats - compute statistics on a fsdb column

dbcolstats [-amS] [-c ConfidenceFraction] [-q NumberOfQuantiles] column

Compute statistics over a COLUMN of data.
Records containing non-numeric data are considered null do not contribute to the stats (with the `-a`

option they are treated as zeros).

Confidence intervals are a t-test (+/- (t_{a/2})*s/sqrt(n)) and assume the population takes a normal distribution with a small number of samples (< 100).

By default,
all statistics are computed for as a population *sample* (with an ``n-1'' term),
not as representing the whole population (using ``n'').
Select between them with **--sample** or **--nosample**.
When you measure the entire population,
use the latter option.

The output of this program is probably best looked at after reformatting with dblistize.

Dbcolstats runs in O(1) memory. Median or quantile requires sorting the data and invokes dbsort. Sorting will run in constant RAM but O(number of records) disk space. If median or quantile is required and the data is already sorted, dbcolstats will run more efficiently with the -S option.

**-a**or**--include-non-numeric**-
Compute stats over all records (treat non-numeric records as zero rather than just ignoring them).

**-c FRACTION**or**--confidence FRACTION**-
Specify FRACTION for the confidence interval. Defaults to 0.95 for a 95% confidence factor.

**-f FORMAT**or**--format FORMAT**-
Specify a printf(3)-style format for output statistics. Defaults to

`%.5g`

. **-m**or**--median**-
Compute median value. (Will sort data if necessary.) (Median is the quantitle for N=2.)

**-q N**or**--quantile N**-
Compute quantile (quartile when N is 4), or an artbitrary quantile for other values of N, where the scores that are 1 Nth of the way across the population.

**--sample**-
Compute

*sample*population statistics (e.g., the sample standard devation), assuming*n-1*degress of freedom. **--nosample**-
Compute

*whole*population statistics (e.g., the population standard devation). **-S**or**--pre-sorted**-
Assume data is already sorted. With one -S, we check and confirm this precondition. When repeated, we skip the check.

**-F**or**--fs**or**--fieldseparator**S-
Specify the field (column) separator as

`S`

. See dbfilealter for valid field separators. **-T TmpDir**-
where to put temporary data. Only used if median or quantiles are requested. Also uses environment variable TMPDIR, if -T is not specified. Default is /tmp.

This module also supports the standard fsdb options:

**-d**-
Enable debugging output.

**-i**or**--input**InputSource-
Read from InputSource, typically a file name, or

`-`

for standard input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue objects. **-o**or**--output**OutputDestination-
Write to OutputDestination, typically a file name, or

`-`

for standard output, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue objects. **--autorun**or**--noautorun**-
By default, programs process automatically, but Fsdb::Filter objects in Perl do not run until you invoke the run() method. The

`--(no)autorun`

option controls that behavior within Perl. **--help**-
Show help.

**--man**-
Show full manual.

#fsdb absdiff 0 0.046953 0.072074 0.075413 0.094088 0.096602 # | /home/johnh/BIN/DB/dbrow # | /home/johnh/BIN/DB/dbcol event clock # | dbrowdiff clock # | /home/johnh/BIN/DB/dbcol absdiff

cat data.fsdb | dbcolstats absdiff

#fsdb mean stddev pct_rsd conf_range conf_low conf_high conf_pct sum sum_squared min max n 0.064188 0.036194 56.387 0.037989 0.026199 0.102180.95 0.38513 0.031271 0 0.096602 6 # | /home/johnh/BIN/DB/dbrow # | /home/johnh/BIN/DB/dbcol event clock # | dbrowdiff clock # | /home/johnh/BIN/DB/dbcol absdiff # | dbcolstats absdiff # 0.95 confidence intervals assume normal distribution and small n.

dbmultistats(1), handles multiple experiments in a single file.

dblistize(1), to pretty-print the output of dbcolstats.

dbcolpercentile(1), to compute an even more general version of median/quantiles.

dbcolstatscores(1), to compute z-scores or t-scores for each row

dbrvstatdiff(1), to see if two sample populations are statistically different.

Fsdb.

The algorithms used to compute variance have not been audited to check for numerical stability. (See *http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance*).) Variance may be incorrect when standard deviation is small relative to the mean.

The field `conf_pct`

implies percentage, but it's actually reported as a fraction (0.95 means 95%).

Because of limits of floating point, statistics on numbers of widely different scales may be incorrect. See the test cases *dbcolstats_extrema* for examples.

$filter = new Fsdb::Filter::dbcolstats(@arguments);

Create a new dbcolstats object, taking command-line arugments.

$filter->set_defaults();

Internal: set up defaults.

$filter->parse_options(@ARGV);

Internal: parse command-line arguments.

$filter->setup();

Internal: setup, parse headers.

$i = _round_up($x);

Internal: Round up to the next integer.

($median, $quantile_aref) = _compute_quantile($n, $mean);

Internal: Compute quantile from the saved data. Not generalizable. We assume the saved output is closed before we enter.

$filter->run();

Internal: run over each rows.

Copyright (C) 1991-2007 by John Heidemann <johnh@isi.edu>

This program is distributed under terms of the GNU general public license, version 2. See the file COPYING with the distribution for details.

syntax highlighting: