Lincoln D. Stein > Bio-BigFile > Bio::DB::BigWig

Download:
Bio-BigFile-1.07.tar.gz

Dependencies

Annotate this POD

CPAN RT

New  4
Open  0
View/Report Bugs
Source  

SYNOPSIS ^

   use Bio::DB::BigWig 'binMean';

   my $wig  = Bio::DB::BigWig->new(-bigwig=>'ExampleData/dpy-27-variable.bw',
                                    -fasta=>'/tmp/elegans.fa');

   # Fetch individual intervals
   # fetch the individual data points in the wig file over a region of interest
   my @points = $wig->features(-seq_id=>'I',-start=>100,-end=>1000);
   for my $p (@points) {
      my $start = $p->start;
      my $end   = $p->end;
      my $val   = $p->score;
      print "$start..$end : $val\n";
   }

   # same thing but using a "segment" object
   my $segment = $wig->segment('I',100=>1000);
   for my $p ($segment->features) {
      my $start = $p->start;
      # etc.
   }

   # Same thing, but using an iterator.
   my $iterator = $wig->get_seq_stream(-seq_id=>'I',-start=>100,-end=>1000);
   while (my $p = $iterator->next_seq) {
      my $start = $p->start;
      # etc.
   }

   # dump whole thing out as a BEDGraph file
   my $iterator = $wig->get_seq_stream();
   while (my $p = $iterator->next_seq) {
      my $seqid = $p->seq_id;
      my $start = $p->start;
      my $end   = $p->end;
      my $val   = $p->score;
      print join("\t",$seqid,$start,$end,$val),"\n";
   }

   # Statistical summaries using "bin" feature type
   # Fetch 10 intervals across region 5M=>6M on chromosome I
   my @bins = $wig->features(-seq_id=>'I',
                             -start  => 5_000_000,
                             -end    => 6_000_000,
                             -type=>'bin:10');
   for my $b (@bins) {
      my $start = $b->start;
      my $end   = $b->end;
      print "$start..$end, mean = ",$b->mean,"\n";
   }

   # same thing, but get 100 intervals across all of chromosome I
   my @bins = $wig->features(-seq_id=>'I',
                             -type=>'bin:100');
   for my $b (@bins) {
      my $start = $b->start;
      # etc.
   }

   # same thing, but get summaries across entirety of each chromosome
   my @bins = $wig->features(-type=>'bin'); # same as "bin:1"
   for my $b (@bins) {
       my $chrom = $b->seq_id;
       print "$chrom mean val: ",$b->mean,"\n";
   }
   
   # alternative interface using the memory-efficient "summary" feature type

   # get statistical summaries across all chromosomes
   my @summary = $wig->features(-type=>'summary');# one for each chromosome
   for my $s (@summary) {
      print "chromosome ",$s->seq_id,"\n";
      my $stats = $s->statistical_summary(10);   # 10 evenly-spaced bins as an array ref
      print "\tmax  = ",$stats->[0]{maxVal},"\n";
      print "\tmean = ",binMean($stats->[0]),"\n";
   }

   # get statistical summary across just chromosome I
   my ($summary) = $wig->features(-seq_id=>'I',-type=>'summary'); 
   my $stats = $summary->statistical_summary(10);   # 10 evenly-spaced bins as an array ref
   print "\tmax  = ",$stats->[0]{maxVal},"\n";
   print "\tmean = ",binMean($stats->[0]),"\n";
   
   # get statistical summary across a subregion
   ($summary) = $wig->features(-seq_id => 'I',
                               -start  => 5_000_000,
                               -end    => 6_000_000,
                               -type   => 'summary');
   $stats = $summary->statistical_summary(10); # 10 bins across region

   # get an iterator across the intervals covered by a summary
   my $i = $summary->get_seq_stream();
   while (my $interval = $i->next_seq) {
      print $interval->start,'..',$interval->end,': ',$interval->score,'\n";
   }

DESCRIPTION ^

This module provides a high-level interface to Jim Kent's BigWig files, a type of indexed genome feature database that can be randomly accessed across the network. Please see http://genome.ucsc.edu/FAQ/FAQformat.html for information about creating these files.

For the low-level interface, please see Bio::DB::BigFile. BigBed files are supported by the module Bio::DB::BigBed.

Installation

Installation requires a compiled version of Jim Kent's source tree, including the main library, jkweb.a. Please see the README in the Bio-BigFile distribution directory for instructions.

BioPerl SeqFeature APi

This high-level interface places a BioPerl-compatible API on top of the native Bio::DB::BigFile interface. This API will be famiiar to users of the Bio::DB::SeqFeature and Bio::DB::GFF modules. You use the features() and get_seq_stream() method to query the database for features of various types. The features returned obey the Bio::SeqFeatureI interface, and provide methods for accessing the feature's type, coordinates, score, and subfeatures.

Please note that all genomic coordinates consumed or returned by this module are one-based closed intervals, identical to the BioPerl standard. This is not true of the low level interfaces.

CLASS METHODS ^

The new() method allows you to create new instances of Bio::DB::BigWig.

$bw = Bio::DB::BigWig->new(-bigwig=>$bw_path,-fasta=>$fa_path) =item $bw = Bio::DB::BigWig->new($bw_path)

Create a new Bio::DB::BigWig object. The -bigwig argument (required) points to the path to the indexed .bw file you wish to open. Alternatively, you may pass an http: or ftp: URL in order to open a remote BigWig file. A shorter version of new() allows you to pass a single argument consisting of the BigWig file path.

The optional -fasta argument provides a path to a FASTA file containing the genome sequence corresponding to the original WIG file. All DNA sequences come from this file, so annoying and confusing things will happen if use the wrong genome build. The file must use chromosome/contig identifiers that match those in the WIG file from which the BigWig was built.

This module uses the BioPerl Bio::DB::Fasta libary to build an index of the FASTA file, which means that the directory in which the FASTA file resides must be writable by the current process the first time you use it. Alternately, you can pass the -fasta option a previously-opened Perl object that supports a seq() method. This method takes three arguments consisting of the chromosome/contig name and the start and end coordinates of the region of interest in 1-based coordinates. It returns the DNA as a plain string.

    my $dna_string = $object->seq($seqid,$start,$end);

Suitable implementations include Bio::DB::SeqFeature::Store (part of BioPerl) and Bio::DB::Sam::Fai, part of the Bio::SamTools package. You are of course welcome to implement your own Fasta object.

When opening a remote file on an FTP or HTTP server, the directory returned by Bio::DB::BigFile->udcGetDefaultDir must be writable (usually '/tmp/udcCache'). The new() method will attempt to catch the case in which this directory is not writable and instead set the cache to /tmp/udcCache_###, where ### is the current username. For better control over this behavior, you may set the environment variable UDC_CACHEDIR before creating the BigWig file.

OBJECT METHODS ^

The following are public methods available to Bio::DB::BigWig objects.

Accessors

Here are read-only accessors that give you limited access to the contents of the BigWig object. In the method synopses given below, $bigwig is a Bio::DB::BigWig object.

$bigfile = $bigwig->bf

Return the low-level Bio::DB::BigFile underlying the object.

$bigfile = $bigwig->bigwig

An alias for bf().

$fasta = $bigwig->fa

Return the DNA accessor (usually a Bio::DB::Fasta object) which the object uses to fetch the sequence of the reference genome.

Retrieving intervals and summary statistics

This section describes methods that return lists of intervals and summary statistics from the BigWig object. Most of the methods are oriented towards retrieving information about the distribution of values in the WIG file. Summary information typically consists of the following fields:

  Key                Value
  ---            ---------

  validCount     Number of intervals in the bin

  maxVal         Maximum value in the bin

  minVal         Minimum value in the bin

  sumData        Sum of the intervals in the bin

  sumSquares     Sum of the squares of the intervals in the bin

From these values one can determine the mean, variance and standard deviation across one or more genomic intervals. The formulas are as follows:

 sub mean {
    my ($sumData,$validCount) = @_;
    return $sumData/$validCount;
 }

 sub variance {
    my ($sumData,$sumSquares,$validCount) = @_;
    my $var = $sumSquares - $sumData*$sumData/$validCount;
    if ($validCount > 1) {
        $var /= $validCount-1;
    }
    return 0 if $var < 0;
    return $var;
 }

 sub stdev {
     my ($sumData,$sumSquares,$validCount) = @_;
     my $variance = variance($sumData,$sumSquares,$validCount);
     return sqrt($variance);
 }

Note that in the calculation of variance, there is a chance of getting very small negative numbers in a tight distribution due to floating point rounding errors. Hence the check for variance < 0. To pool bins, simply sum the individual values.

For your convenience, Bio::DB::BigWig optionally exports functions that perform these calculations for you. Please see "EXPORTED FUNCTIONS" below.

The following methods allow you to query the BigWig file:

@features = $bigwig->features(@args)

This method is the workhorse for retrieving various types of intervals and summary statistics from the BigWig database. It takes a series of named arguments in the format (-argument1 => value1, -argument2 => value2, ...) and returns a list of zero or more BioPerl Bio::SeqFeatureI objects.

The following arguments are recognized:

   Argument     Description                         Default
   --------     -----------                         -------

   -seq_id      Chromosome or contig name defining  All chromosomes/contigs.
                the range of interest.

   -start       Start of the range of interest.     1

   -end         End of the range of interest        Chromosome/contig end

   -type        Type of feature to retrieve         'region'
                    (see below). 

   -iterator    Boolean, which if true, returns     undef (false)
                an iterator across the list rather
                than the list itself.

Each returned feature supports the standard Bio::SeqFeatureI methods, including seq_id(), start(), end(), strand(), display_name(), primary_id(), primary_tag() and score(). Typically, the seq_id, start, end and score are of the greatest interest.

Three different feature types are accepted, each one with slightly different properties:

region,interval

Features retrieved using either of these types will represent the raw intervals and values present in the original WIG file (the two are equivalent, "region" is preferred because it is a Sequence Ontology term, but "interval" is probably more natural). Note that this operation may consume a lot of memory and be processor-intensive. It is almost always better to create an iterator using get_seq_stream().

These features have the following useful methods:

  $feature->seq_id()     The chromosome/contig name
  $feature->start()      Start of the interval
  $feature->end()        End of the interval
  $feature->score()      Score for the interval
  $feature->primary_id() Primary ID for the interval

Due to floating point precision issues at the BigWig C level, the scores may be very slightly different from the originals.

Here is a simple script that dumps out the contents of chromosome I in BedGraph format:

 use Bio::DB::BigWig;

  my $wig  = Bio::DB::BigWig->new(-bigwig=>$path);
  my @features = $wig->features(-seq_id=>'I',-type=>'region');
  for my $p (@features) {
     my $seqid = $p->seq_id;
     my $start = $p->start;
     my $end   = $p->end;
     my $val   = $p->score;
     print join("\t",$seqid,$start,$end,$val),"\n";
 }

bin:#count

A type named "bin:" followed by an integer will divide each chromosome/contig into the indicated number of summary bins, and return one feature for each bin. For example, type "bin:100" will return 100 evenly-spaced bins across each chromosome/contig.

The returned bin features have the same methods as those returned by the "region"/"interval" types, except that the start() and end() methods return the boundaries of the bin rather than any individual interval reported in the WIG file. Instead of returning a single value, the score() method returns a hash of statistical summary information containing the keys validCount, maxVal, minVal, sumData and sumSquares as described earlier.

In addition, the bin objects add the following convenience methods:

 $bin->count()    Same as $bin->score->{validCount}
 $bin->minVal()   Same as $bin->score->{minVal}
 $bin->maxVal()   Same as $bin->score->{maxVal}
 $bin->mean()     The mean of values in the bin (from the formula above)
 $bin->variance() The variance of values in the bin (ditto)
 $bin->stdev()    The standard deviation of values in the bin (ditto)

If no number is specified (i.e. you search for type "bin"), then an interval of "1" is assumed. You will receive one bin spanning each chromosome/contig.

summary

This feature type is similar to "bin" except that instead of returning one feature for each binned interval on the genome, it returns a single object from which you can retrieve summary statistics across fixed-width bins in a more memory-efficient manner. Call the object's statistical_summary() method with the number of bins you need to get an array ref of bins length. Each element of the array will be a hashref containing the minVal, maxVal, sumData, sumSquares and validCount keys. The following code illustrates how this works:

 use Bio::DB::BigWig 'binMean','binStdev';
 my $wig = Bio::DB::BigWig->new(-bigwig=>$path);
 my @chroms = $wig->features(-type=>'summary');

 for my $c (@chroms) {
    my $seqid   = $c->seq_id;
    my $c_start = $c->start;

    my $stats     = $c->statistical_summary(10);
    my $bin_width = $c->length/@$stats;
    my $start     = $c_start;

    for my $s (@$stats) {
        my $mean  = binMean($s);
        my $stdev = binStdev($s);
        my $end   = $start + $bin_width-1;
        print "$seqid:",int($start),'..',int($end),": $mean +/- $stdev\n";
    } continue {
       $start += $c_start;
    }
 }

"summary" features have a score() method which returns a statistical summary hash across the entire region. They also have a get_seq_stream() method which returns a feature iterator across the region they cover.

 my $summary = $c->score;

To get the mean and stdev across the entire chromosome containing the summary feature, call its chr_mean() and chr_stdev() methods:

 my $mean  = $c->chr_mean;
 my $stdev = $c->chr_stdev;

To get the global mean and stdev across the entire genome containing the summary feature, call its global_mean() and global_stdev() methods:

 my $mean  = $c->global_mean;
 my $stdev = $c->global_stdev;

The summary object's chr_stats() and global_stats() methods return a hashref containing the statistical summary for the chromosome and genome respectively.

 my $stats  = $c->global_stats;
 my $stats = $c->chr_stats;

To receive the underlying bigwig database from a summary feature, you may call its bigwig() method:

  my $bigfile = $c->bigwig;

If the argument -iterator=>1 is present, then instead of returning an array of features, the method call will return a single object that has a next_seq() method. Each time next_seq() is called it will return a retrieved feature in a memory-efficient manner. Here is a script that will dump the entire BigWig file out in BedGraph format:

 use Bio::DB::BigWig;

  my $wig  = Bio::DB::BigWig->new(-bigwig=>$path);
  my $iterator = $wig->features(-iterator=>1,-type=>'region');
  while (my $p $iterator->next_seq) {
     my $seqid = $p->seq_id;
     my $start = $p->start;
     my $end   = $p->end;
     my $val   = $p->score;
     print join("\t",$seqid,$start,$end,$val),"\n";
 }

The next few methods are basically convenience interfaces to the features() method:

@features = $bigwig->get_features_by_type($type)

This method returns all features of the specified type from the BigWig file without regard to their location

@features = $bigwig->get_features_by_location($seqid,$start,$end)
@features = $bigwig->get_features_by_location(@named_args)

get_features_by_location() retrieves features across a specified interval of the genome. In its three-argument form, it accepts the ID of the chromosome/contig, the start position and the end position of the desired range. If start or end are omitted, they default to the beginning and end of the chromosome respectively.

In the named-argument form, it behaves essentially identically to features().

Typical usage to fetch 100 bins across the region of chromosome "I" from 1Mb to 2Mb:

  my @bins = $bigwig->get_features_by_location(-seq_id => 'I',
                                               -start  => 1_000_000,
                                               -end    => 2_000_000,
                                               -type   => 'bins:100');

  foreach (@bins) {
      print $_->start,'..',$_->end,': ',$_->mean,"\n";
  }
@features = $bigwig->get_features_by_name($name)
@features = $bigwig->get_features_by_alias($name)
@features = $bigwig->get_features_by_attribute(%attributes)
$feature = $bigwig->get_feature_by_id($id)
$feature = $bigwig->get_feature_by_name($name)

These methods are supported for compatibility with like-named methods in BioPerl's Bio::DB::SeqFeature::Store class. However they do not do anything useful, as these are not properties of WIG data.

$iterator = $bigwig->get_seq_stream(@args)

This method is identical to calling $bigwig->features(-iterator=>1,@args), and returns an iterator object that will fetch the result of the query by calling next_seq repeatedly:

  my $i = $bigwig->get_seq_stream(-seq_id => 'I',
                                  -start  => 1_000_000,
                                  -end    => 2_000_000,
                                  -type   => 'bins:100');

  while (my $b = $i->next_seq;
      print $b->start,'..',$b->end,': ',$b->mean,"\n";
  }
@seq_ids = $bw-seq_ids>

This method returns the names of all the chromosomes/contigs known to the BigWig file.

$length = $bw-length($seqid)>

Given the ID of a chromosome or contig, this returns its size in bases. If the sequence ID is unknown, returns undef.

$dna = $bw->seq($seqid,$start,$end)

Given a sequence ID and a range, this method returns the DNA bases (a string, not a Bio::PrimarySeq object) if a DNA accessor or FASTA path was defined at object creation time. Otherwise it will return undef.

Segments

Several "segment" methods allows you to define a range on the genome and retrieve information across it.

$segment = $bigwig->segment($seqid,$start,$end) =item $segment = $bigwig->segment(-seq_id=$seqid, -start=>$start, -end=>$end)>

The segment() method returns a Bio::DB::BigFile::Segment object. It has two forms. In the positional form, pass it the sequence ID (chromosome or contig name), and start and end positions of the range. Start defaults to 1 if missing or undef, and end defaults to the full length of the chromosome/contig. The named form takes the named arguments -seq_id, -start and -end.

Once a segment is defined, the following methods are available:

@features = $segment->features([$type])

Return a list of features that overlap the segment. You may us this to retrieve statistical summary information about the segment or subranges of the segment, or to return the underlying WIG values. The optional $type argument allows you to select what type of information to retrieve. Options are bin to retrieve statistical information as a series of bin features spanning the segment, summary to retrieve a single summary object describing the entire segment, or region to retrieve the intervals defined in the original WIG file (the type named interval is an alias for 'region'). If no type is given, then region is assumed. See the discussion of the Bio::DB::WigFile->features() method for a more in-depth discussion of how this works.

$iterator = $segment->get_seq_stream([$type])

This method returns an iterator across the segment. Call the iterator's next_seq() method repeatedly to step through the features contained within the segment in a memory-efficient manner:

  my $iterator = $segment->get_seq_stream('region');
  while (my $f = $iterator->next_seq) {
      print $f->score,"\n";
  }
$seqid = $segment->seq_id
$start = $segment->start
$end = $segment->end

These methods return the chromosome/contig name, start and end of the segment range, respectively.

EXPORTED FUNCTIONS ^

These convenience functions are optionally exportable. You must explicitly request them, as in:

 use Bio::DB::BigWig qw(binMean binVariance, binStdev);
$mean = binMean($feature->score)

Return the mean of a summary statistics hash, such as those returned by the bin and summary feature score() method. This will also suitable for the elements of the array ref returned by summary features' statistical_summary() method.

$variance = binVariance($feature->score)

As above, but returns the variance of the summary statistics.

$sd = binStdev($feature->score)

As above, but returns the standard deviaton of the summary statistics.

Using BigWig objects with Bio::Graphics ^

Recent versions of the Bio::Graphics module (see Bio::Graphics) directly supports the "summary" feature type via the wiggle_whiskers glyph. This glyph uses different color intensities to summarize the mean, standard deviation, min and max of bins across the range. You do not have to specify the bin size -- the glyph will choose a bin that is most appropriate for the width of the drawing. Typical usage is like this:

 use Bio::DB::BigWig;
 use Bio::Graphics;

 my $path      = 'ExampleData/dpy-27-variable.bw';
 my $wig       = Bio::DB::BigWig->new($path) or die;
 my ($summary) = $wig->features(-seq_id=>'I',
                                -type=>'summary');

 my $panel     = Bio::Graphics::Panel->new(-width   => 800,
                                           -segment => $summary,
                                           -key_style => 'between',
                                           -pad_left => 10,
                                           -pad_right=>18,
    );
 # add the scalebar
 $panel->add_track($summary,
                  -glyph => 'arrow',
                  -tick  => 2,
                  -label => 'chrI',
    );

 # add the whisker
 $panel->add_track($summary,
                  -glyph => 'wiggle_whiskers',
                  -height => 80,
                  -key   => 'Summary statistics',
    );
 print $panel->png;

Using BigWig objects and GBrowse ^

The Generic Genome Browser version 2.0 (http://www.gmod.org/gbrowse) can treat a BigWig file as a track database. A typical configuration will look like this:

 [BigWig:database]
 db_adaptor    = Bio::DB::BigWig
 db_args       = -bigwig /var/www/data/dpy-27-variable.bw
                 -fasta  /var/www/data/elegans-ws190.fa

 [BigWigIntervals]
 feature  = summary
 database = BigWig
 glyph    = wiggle_whiskers
 min_score = -1
 max_score = +1.5
 key       = DPY-27 ChIP-chip

SEE ALSO ^

Bio::DB::BigFile, Bio::Perl, Bio::Graphics, Bio::Graphics::Browser2

AUTHOR ^

Lincoln Stein <lincoln.stein@oicr.on.ca>. <lincoln.stein@bmail.com>

Copyright (c) 2010 Ontario Institute for Cancer Research.

This package and its accompanying libraries is free software; you can redistribute it and/or modify it under the terms of the GPL (either version 1, or at your option, any later version) or the Artistic License 2.0. Refer to LICENSE for the full license text. In addition, please see DISCLAIMER.txt for disclaimers of warranty.

syntax highlighting: