The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

get_datasets.pl

A program to collect data for a list of features

SYNOPSIS

get_datasets.pl [--options...] [<filename>]

get_datasets.pl [--options...] --in <filename> <data1> <data2...>

  Options for existing files:
  --in <filename>                  (txt bed gff gtf refFlat ucsc)
  
  Options for new files:
  --db <name | filename>
  --feature <type | type:source | alias>, ...
  --win <integer>                                           (500)
  --step <integer>                                          (win)
  
  Options for data collection:
  --ddb <name | filename>
  --data <none | file | type>, ...
  --method [mean|median|stddev|min|max|range|sum|          (mean)
            count|pcount|ncount]
  --strand [all|sense|antisense]                            (all)
  --force_strand
  --exons
  --fpkm [region|genome]
  
  Adjustments to features:
  --extend <integer>
  --start=<integer>
  --stop=<integer>
  --fstart=<decimal>
  --fstop=<decimal>
  --pos [5|m|3]                                             (5)
  --limit <integer>
  
  General options:
  --out <filename>
  --gz
  --cpu <integer>                                           (2)
  --version
  --help                              show extended documentation

OPTIONS

The command line flags and descriptions:

--in <filename>

Specify an input file containing either a list of database features or genomic coordinates for which to collect data. Any tab-delimited text file with recognizable headers is supported. Gene annotation file formats are also supported, including bed, gtf, gff3, refFlat, and UCSC native formats such as gene prediction tables are all supported. Gene annotation files will be parsed as sequence features. Files may be gzipped compressed.

--out <filename>

Specify the output file name. Required for new feature tables; optional for current files. If this is argument is not specified then the input file is overwritten.

--db <name | filename>

Specify the name of a Bio::DB::SeqFeature::Store annotation database from which gene or feature annotation may be derived. A database is required for generating new data files with features. This option may skipped when using coordinate information from an input file (e.g. BED file), or when using an existing input file with the database indicated in the metadata. For more information about using annotation databases, see https://code.google.com/p/biotoolbox/wiki/WorkingWithDatabases.

--feature <type | type:source | alias>,...
--feature genome

Specify the type of feature from which to collect values. This is required only for new feature tables. Three types of values may be passed: the feature type, feature type and source expressed as 'type:source', or an alias to one or more feature types. Aliases are specified in the biotoolbox.cfg file and provide a shortcut to a list of one or more database features. More than one feature may be included as a comma-delimited list (no spaces).

To collect genomic intervals (or regions) simply specify 'genome' as the feature type.

--win <integer>

When generating a new genome interval list (feature type 'genome'), optionally specify the window size. The default size is defined in the configuration file, biotoolbox.cfg.

--step <integer>

Optionally indicate the step size when generating a new list of intervals across the genome. The default is equal to the window size.

--ddb <name | filename>

If the data to be collected is from a second database that is separate from the annotation database, provide the name of the data database here. Typically, a second Bio::DB::SeqFeature::Store or BigWigSet database is provided here.

--data <type1,type2,type3&type4,...>
--data <file1,...>
--data none

Provide the name of the dataset to collect the values. Use this argument repeatedly for each dataset to be collected. Two or more datasets may be merged into one by delimiting with an ampersand "&" (no spaces!). If no dataset is specified on the command line, then the program will interactively present a list of datasets from the database to select.

The dataset may be a feature type in a BioPerl Bio::DB::SeqFeature::Store or Bio::DB::BigWigSet database. Provide either the feature type or type:source. The feature may point to another data file whose path is stored in the feature's attribute tag (for example a binary Bio::Graphics::Wiggle .wib file, a bigWig file, or Bam file), or the features' scores may be used in data collection.

Alternatively, the dataset may be a database file, including bigWig (.bw), bigBed (.bb), useq (.useq), or Bam alignment (.bam) files. The files may be local or remote (specified with a http: or ftp: prefix).

To force the program to simply write out the list of collected features without collecting data, provide the dataset name of "none".

--method <text>

Specify the method for combining all of the dataset values within the genomic region of the feature. Accepted values include:

  • mean (default)

  • median

  • sum

  • stddev Standard deviation of the population (within the region)

  • min

  • max

  • range Returns difference of max and min

  • count

    Counts the number of overlapping items.

  • pcount (precise count)

    Counts the number of items that precisely fall within the query region. Partially overlapping are not counted.

  • ncount (name count)

    Counts unique names. Useful when spliced alignments overlap more than one exon and you want to avoid double-counting.

--strand [all | sense | antisense]

Specify whether stranded data should be collected for each of the datasets. Either sense or antisense (relative to the feature) data may be collected. Note that strand is not supported with some data files, including bigWig files (unless specified through a GFF3 feature attribute or Bio::DB::BigWigSet database) and Bam files (score coverage is not but count is). The default value is 'all', indicating all data will be collected.

--force_strand

For features that are not inherently stranded (strand value of 0) or that you want to impose a different strand, set this option when collecting stranded data. This will reassign the specified strand for each feature regardless of its original orientation. This requires the presence of a "strand" column in the input data file. This option only works with input file lists of database features, not defined genomic regions (e.g. BED files). Default is false.

--exons

Optionally indicate that data should be collected only over the exon subfeatures of a gene or transcript, rather than the entire gene. Subfeatures with a primary_tag of exon are preferentially taken. If exons are not defined, then CDS and UTR subfeatures are used, or the entire gene or transcript if no appropriate subfeatures are found. Note that the options extend, start, stop, fstart, and fstop are ignored. Default is false.

--fpkm [region|genome]

Optionally indicate that counts should be converted to Fragments Per Kilobase per Million mapped (FPKM). This is a method for normalizing sequence read depth and is used with Bam (or optionally bigBed) files. Two methods exist for normalizing:

region

Uses the sum of counts over all input regions examined and ignores non-counted reads

genome

Uses the sum of all reads across the genome, regardless of whether it was counted in an input region or not

The region method is best used with RNASeq data and a complete gene annotation table. The genome method is best used with partial annotation tables or other Seq types, such as ChIPSeq. This option can only be used with one of the count methods (count, ncount, pcount). The FPKM values are appended as additional columns in the output table.

--extend <integer>

Optionally specify the bp extension that will be added to both sides of the feature's region.

--start=<integer>
--stop=<integer>

Optionally specify adjustment values to adjust the region to collect values relative to the feature position defined by the --pos option (default is the 5' position). A negative value is shifted upstream (5' direction), and a positive value is shifted downstream. Adjustments are always made relative to the feature's strand. Both options must be applied; one is not allowed.

--fstart=<number>
--fstop=<number>

Optionally specify the fractional start and stop position of the region to collect values as a function of the feature's length and relative to the specified feature position defined by the --pos option (default is 5'). The fraction should be presented as a decimal number, e.g. 0.25. Prefix a negative sign to specify an upstream position. Both options must be applied; one is not allowed.

--pos [5|m|3]

Indicate the relative position of the feature with which the data is collected when combined with the "start" and "stop" or "fstart" and "fstop" options. Three values are accepted: "5" indicates the 5' prime end is used, "3" indicates the 3' end is used, and "m" indicates the middle of the feature is used. The default is to use the 5' end, or the start position of unstranded features.

--limit <integer>

Optionally specify the minimum size limit for subfractionating a feature's region. Used in combination with fstart and fstop to prevent taking a subregion from a region too small to support it. The default is 10 bp.

--gz

Indicate whether the output file should (not) be compressed by gzip. If compressed, the extension '.gz' is appended to the filename. If a compressed file is opened, the compression status is preserved unless specified otherwise.

--cpu <integer>

Specify the number of CPU cores to execute in parallel. This requires the installation of Parallel::ForkManager. With support enabled, the default is 2. Disable multi-threaded execution by setting to 1.

--version

Print the version number.

--help

Display the POD documentation for this program.

DESCRIPTION

This program will collect dataset values from a variety of sources, including features in a BioPerl Bio::DB::SeqFeature::Store database, binary wig files (.wib) loaded in a database using Bio::Graphics::Wiggle, bigWig files, bigBed files, Bam alignment files, or a Bio::DB::BigWigSet database.

The values are collected for a list of known database features (genes, transcripts, etc.) or genomic regions (defined by chromosome, start, and stop). The list may be provided as an input file or generated as a new list from a database. Output data files may be reloaded for additional data collection.

At each feature or interval, multiple data points within the genomic segment are combined statistically and reported as a single value for the feature. The method for combining datapoints may be specified; the default method is the mean of all datapoints.

The coordinates of the features may be adjusted in numerous ways, including specifying a specific relative start and stop, a fractional start and stop, an extension to both start and stop, and specifying the relative position (5' or 3' or midpoint).

Stranded data may be collected, if the dataset supports stranded information. Also, two or more datasets may be combined and treated as one. Note that collecting stranded data may significantly slow down data collection.

EXAMPLES

These are some examples of some common scenarios for collecting data.

Simple mean scores

You want to collect the mean score from a bigWig file for each feature in a BED file of intervals.

  get_datasets.pl --data scores.bw --in input.bed
Collect normalized counts

You want to collect normalized read counts from a Bam file of alignments for each feature in a BED file.

  get_datasets.pl --data alignments.bam --method rpm --in input.bed
Collect stranded RNASeq data

You have stranded RNASeq data, and you would like to determine the expression level for all genes in an annotation database.

  get_datasets.pl --db annotation --feature gene --data rnaseq.bam \
  --strand sense --exons --method rpkm --out expression.txt
Restrict to specific region

You have ChIPSeq enrichment scores in a bigWig file and you now want to score just the transcription start site of known transcripts in an annotation database. Here you will restrict to 500 bp flanking the TSS.

  get_datasets.pl --db annotation --feature mRNA --start=-500 \
  --stop=500 --pos 5 --data scores.bw --out tss_scores.txt
Count intervals

You have identified all possible transcription factor binding sites in the genome and put them in a bigBed file. Now you want to count how many exist in each upstream region of each gene.

  get_datasets.pl --db annotation --feature gene --start=-5000 \
  --stop=0 --data tfbs.bb --method count --out tfbs_sums.txt
Many datasets at once

You can place multiple bigWig files in a single directory and treat it as a data database, known as a BigWigSet. Each file becomes a database feature, and you can interactively choose one or more from which to collect. Each dataset is appended to the input file as new column.

  get_datasets.pl --ddb /path/to/bigwigset --in input.txt
Stranded BigWig data

You can generate stranded RNASeq coverage from a Bam file using the BioToolBox script bam2wig.pl, which yields rnaseq_f.bw and rnaseq_r.bw files. These are automatically interpreted as stranded datasets in a BigWigSet context.

  get_datasets.pl --ddb /path/to/rnaseq/bigwigset --strand sense \
  --in input.txt
Binned coverage across the genome

You are interested in sequencing depth across the genome to look for depleted regions. You count reads in 1 kb intervals across the genome.

  get_datasets.pl --db genome.fasta --feature genome --win 1000 \
  --data alignments.bam --method count --out coverage.txt
Middle of feature

You are interested in the maximum score in the central 50% of each feature.

  get_datasets.pl --fstart=0.25 --fstop=0.75 --data scores.bw --in \
  input.txt

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.