The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

BioX::Seq::Utils - miscellaneous sequence-related functions

SYNOPSIS

    if ( is_nucleic($seq) ) {
        $seq = rev_com( $seq );
    }

    my @orfs = all_orfs(
        $seq,
        3,   # ORF mode
        200, # min length
    );

    my $re = build_ORF_regex(
        0,   # ORF mode
        300, # min length
    );
        

DESCRIPTION

BioX::Seq::Utils contain a number of sequence-related functions. They are general functions that are used often enough to warrant inclusion in a library but not often enough to warrant addition to the core BioX::Seq class. They may also include commonly-used functions that do not make sense to include as BioX::Seq methods, as well as functions that mirror BioX::Seq methods but can be used on raw strings. They act on simple scalars and arrays rather than objects.

NOTE: Use of this module is considered deprecated. It is retained within the <BioX::Seq> package as a number of existing software tools rely on it, but at some point in the future these functions will likely find a new home elsewhere.

FUNCTIONS

rev_com

    my $re = rev_com($seq);

Takes a single scalar argument and returns a scalar containing the reverse complement. Throws an exception if the input value doesn't look like a nucleic acid sequence.

is_nucleic

    if ( is_nucleic($seq) ) {
        # do something
    }

Takes a single scalar argument and returns a boolean value indicating whether the scalar "looks like" a nucleic acid string (i.e. contains no characters but valid IUPAC nucleic acid codes).

all_orfs

    my @orfs = all_orfs(
        $seq,
        2,   # ORF mode
        100, # min length
    );
    for my $orf (@orfs) {
        my ($seq, $start, $end) = @{$orf};
    }

Takes one required argument (a sequence string) and two optional arguments (ORF mode and minimum length) and returns an array of array references representing all ORFs in all reading frames of the sequence. Each reference contains three values: the sequence, the start position, and the stop position. The strand can be determined by comparing start and stop position (ORFs on the reverse strand will have start > stop). See build_ORF_regex() for an explanation for the possible values for ORF mode.

build_ORF_regex

    my $re = build_ORF_regex(
        3,
        300,
    );

Builds a regular expression for matching opening reading frames in a nucleic acid sequence string. Takes two required arguments that are used for building the regular expression:

  • mode - an integer from 0-3 defining the type of open reading frame detected.

    • 0 - any set of codons not containing a start codon

    • 1 - must end with stop codon

    • 2 - must begin with start codon

    • 3 - must begin with start codon and end with stop codon

  • C min_len - an integer representing the minimum number of nucleic acids an open reading frame must contain to be returned (not including the stop codon)

The return value is a compiled expression that can be used to search a sequence string. The pos() function should be used on the string to set the frame to be searched (0-2) prior to applying the regex.

CAVEATS AND BUGS

Please reports bugs to the author.

AUTHOR

Jeremy Volkening <jeremy *at* base2bio.com>

COPYRIGHT AND LICENSE

Copyright 2014 Jeremy Volkening

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.