The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Bio::ToolBox::GeneTools - SeqFeature agnostic methods for working with gene models

SYNOPSIS

    use Bio::ToolBox::GeneTools qw(:all);
    
    my $gene; # a SeqFeatureI compliant gene object obtained elsewhere
              # for example, from Bio::DB::SeqFeature::Store database
              # or parsed from a GFF3, GTF, or UCSC-style gene table using 
              # Bio::ToolBox::parser::(gff,ucsc) parsers
    
    if (is_coding($gene)) { # boolean test
        
        # collect all exons from all transcripts in gene
        my @exons = get_exons($gene);
        
        # find just the alternate exons used only once
        my @alternate_exons = get_alt_exons($gene);
        
        # collect UTRs, which may not be defined in the original source
        my @utrs;
        foreach my $t (get_transcripts($gene)) {
                my @u = get_utrs($t);
                push @utrs, @u;
        }
    }

DESCRIPTION

This module provides numerous exportable functions for working with gene SeqFeature models. This assumes that the gene models follow the BioPerl Bio::SeqFeatureI convention with nested SeqFeature objects representing the gene, transcript, and exons. For example,

    gene
      transcript
        exon
        CDS

Depending upon how the SeqFeatures were generated or defined, subfeatures may or may not be defined or be obvious. For example, UTRs or introns may not be present. Furthermore, the primary_tag or type may not follow Sequence Ontology terms. Regular expressions are deployed to handle varying naming schemes and exceptions.

These functions should work with most or all <Bio::SeqFeatureI> compliant objects. It has been tested with Bio::ToolBox::SeqFeature, Bio::SeqFeature::Lite, and Bio::DB::SeqFeature classes.

New SeqFeature objects that are generated use the same class for simplicity and expediency.

METHODS

Function Import

None of the functions are exported by default. Specify which ones you want when you import the module. Alternatively, use one of the tags below.

:all

Import all of the methods.

:exon

Import all of the exon methods, including get_exons(), get_alt_exons(), get_common_exons(), get_uncommon_exons(), and get_alt_common_exons().

:intron

Import all of the intron methods, including get_introns(), get_alt_introns(), get_common_introns(), get_uncommon_introns(), and get_alt_common_introns().

:transcript

Import the transcript related methods, including get_transcripts(), get_transcript_length(), and collapse_transcripts().

:cds

Import the CDS pertaining methods, including is_coding(), get_cds(), get_cdsStart(), get_cdsEnd(), get_transcript_cds_length(), and get_utrs().

:export

Import all of the export methods, including gff_string(), gtf_string(), and ucsc_string();

Exon Methods

Functions to get a list of exons from a gene or transcript

get_exons($gene)
get_exons($transcript)

This will return an array or array reference of all the exon subfeatures in the SeqFeature object, either gene or transcript. No discrimination whether they are used once or more than once. Non-defined exons can be assembled from CDS and/or UTR subfeatures. Exons are sorted by start coordinate.

get_alt_exons($gene)

This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used only once in all of the transcripts.

get_common_exons($gene)

This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used in all of the transcripts.

get_uncommon_exons($gene)

This will return an array or array reference of all the exon subfeatures for a multi-transcript gene that are used in some of the transcripts, i.e. more than one but not all.

get_alt_common_exons($gene)

This will return a hash reference with several keys, including "common", "uncommon", and each of the transcript IDs. Each key value is an array reference with the exons for that category. The "common" will be all common exons, "uncommon" will be uncommon exons (used more than once but less than all), and each transcript ID will include their specific alternate exons (used only once).

Intron Methods

Functions to get a list of introns from a gene or transcript. Introns are not usually defined in gene annotation files, but are inferred from the exons and total gene or transcript length. In this case, new SeqFeature elements are generated for each intron.

get_introns($gene)
get_introns($transcript)

This will return an array or array reference of all the intron subfeatures in the SeqFeature object, either gene or transcript. No discrimination whether they are used once or more than once. Non-defined introns can be assembled from CDS and/or UTR subfeatures. Introns are sorted by start coordinate.

get_alt_introns($gene)

This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used only once in all of the transcripts.

get_common_introns($gene)

This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used in all of the transcripts.

get_uncommon_introns($gene)

This will return an array or array reference of all the intron subfeatures for a multi-transcript gene that are used in some of the transcripts, i.e. more than one but not all.

get_alt_common_introns($gene)

This will return a hash reference with several keys, including "common", "uncommon", and each of the transcript IDs. Each key value is an array reference with the introns for that category. The "common" will be all common introns, "uncommon" will be uncommon introns (used more than once but less than all), and each transcript ID will include their specific alternate introns (used only once).

Transcript Methods

These methods work on transcripts, typically alternate transcripts from a gene SeqFeature.

get_transcripts($gene)

Returns an array or array reference of the transcripts associated with a gene feature.

collapse_transcripts($gene)
collapse_transcripts($transcript1, $transcript2, ...)

This method will collapse all of the transcripts associated with a gene SeqFeature into a single artificial transcript, merging exons as necessary to maximize exon length and minimize introns. This is useful when performing, for example, RNASeq analysis on genes. A single SeqFeature transcript object is returned containing the merged exon subfeatures.

get_transcript_length($transcript)

Calculates and returns the transcribed length of a transcript, i.e the sum of its exon lengths. Warning! If you pass a gene object, you will get the maximum of all transcript exon lengths, which may not be what you anticipate!

CDS methods

These methods calculate values related to the coding sequence of the mRNA transcripts.

is_coding($gene)
is_coding($transcript)

This method will return a boolean value if the passed transcript object appears to be a coding transcript. GFF and GTF files are not always immediately clear about the type of transcript; there are (unfortunately) multiple ways to encode the feature as a protein coding transcript: primary_tag, source_tag, attribute, CDS subfeatures, etc. This method tries to determine this.

get_cds($transcript)

Returns the CDS subfeatures of the given transcript, if they are defined. Returns either an array or array reference.

get_cdsStart($transcript)

Returns the start coordinate of the CDS for the given transcript. Note that this is the leftmost (smallest) coordinate of the CDS and not necessarily the coordinate of the start codon, similar to what the UCSC gene tables report. Use the transcript strand to determine the 5' end.

get_cdsEnd($transcript)

Returns the stop coordinate of the CDS for the given transcript. Note that this is the rightmost (largest) coordinate of the CDS and not necessarily the coordinate of the stop codon, similar to what the UCSC gene tables report. Use the transcript strand to determine the 3' end.

get_start_codon($transcript)

Returns a SeqFeature object representing the start codon. If one is not defined in the hierarchy, then a new object is created.

get_stop_codon($transcript)

Returns a SeqFeature object representing the stop codon. If one is not defined in the hierarchy, then a new object is created. Not that this assumes that the stop codon is inclusive to the defined CDS.

get_transcript_cds_length($transcript)

Calculates and returns the length of the coding sequence for a transcript, i.e. the sum of the CDS lengths.

get_utrs($transcript)

Returns the 5' and 3' untranslated regions of the transcript. If these are not defined in the SeqFeature subfeature hierarchy, then they will be calculated from the exon and CDS subfeatures, if available. Non-coding transcripts will not return anything.

Export methods

These methods are used for exporting a gene and/or transcript model into a text string based on the specified format.

gff_string($gene)

This is just a convenience method. SeqFeature objects based on Bio::SeqFeature::Lite, Bio::DB::SeqFeature, or Bio::ToolBox::SeqFeature have a gff_string() method, and this will simply call that method. SeqFeature objects that do not have this method will, of course, cause the script to terminate.

Bio::ToolBox::Data::Feature also provides a gff_string method.

gtf_string($gene)

This will export a gene or transcript model as a series of GTF formatted text lines, following the defined Gene Transfer Format (also known as GFF version 2.5). It will ensure that each feature is properly tagged with the gene_id and transcript_id attributes.

ucsc_string($gene)

This will export a gene or transcript model as a refFlat formatted gene Prediction line (11 columns). See http://genome.ucsc.edu/FAQ/FAQformat.html#format9 for details. Multiple transcript genes are exported as multiple text lines concatenated together.

Filter methods

These methods are used to filter genes.

filter_transcript_support_level($gene)
filter_transcript_support_level($gene, $level)

This will filter a gene object for transcripts that match or exceed the provided transcript support level. This assumes that the transcripts contain the attribute tag 'transcript_support_level', which are present in Ensembl provided GFF3 and GTF annotation files. The values are a digit (1-5), or 'NA', where 1 is experimentally supported and 5 is entirely predicted with no experimental evidence. See for example Ensembl TSL glossary entry.

Pass a gene SeqFeature object with one or more transcript subfeatures. Alternatively, an array reference of transcripts could be passed as well.

A level may be provided as a second argument. The default is 'best'.

best

Only the transcripts with the highest existing value will be retained.

best<digit>

All transcripts up to the indicated level are retained. For example, 'best3' would indicate that transcripts with support levels 1, 2, and 3 would be retained.

<digit>

Only transcripts at the given level are retained.

NA

Only transcripts with 'NA' as the value are retained. These are typically pseudogenes or single-exon transcripts.

If none of the transcripts have the attribute, then all are returned (nothing is filtered).

If a gene object was provided, a new gene object will be returned with only the retained transcripts as subfeatures. If an array reference of transcripts was provided, then an array reference of the filtered transcripts is returned.

AUTHOR

 Timothy J. Parnell, PhD
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.