README - metacpan.org

NAME
    Bio::Regexp - Exhaustive DNA/RNA/protein regexp searches

SYNOPSIS
        my @matches = Bio::Regexp->new->dna
                                 ->add('A?GCYY[^G]{2,3}GCGC')
                                 ->add('GAATTC')
                                 ->circular
                                 ->match($input);

        ## Example match:
        {
          'match' => 'AGCTCAAAGCGC',
          'start' => '0',
          'end' => '12',
          'strand' => 1,
          'regexp' => 'A?GCYY[^G]{2,3}GCGC'
        }

DESCRIPTION
    This module is for searching inside DNA or RNA or protein sequences. The
    sequence to be found is specified by a restricted version of regular
    expressions. The restrictions allow us to manipulate the regexp in
    various ways described below. As well as regular expression character
    classes, bases can be expressed in IUPAC short form (which are kind of
    like character classes themselves).

    The goal of this module is to provide a complete search. Given the
    particulars of a sequence (DNA/RNA/protein, linear molecule/circular
    plasmid, single/double stranded) it attempts to figure out all of the
    possible matches without any false-positive or duplicated matches.

    It handles cases where matches overlap in the sequence and cases where
    the regular expression can match in multiple ways. For circular DNA
    (plasmids) it will find matches even if they span the arbitrary location
    in the circular sequence selected as the "start". For double-stranded
    DNA it will find matches on the reverse complement strand as well.

    The typical use case of this module is to search for multiple small
    patterns in large amounts of input data. Although it is optimised for
    that task it is also efficient at others. For efficiency, none of the
    input sequence data is copied at all except to extract matches (but this
    can be disabled with "no_substr") and to implement circular searches
    (though the amount copied is usually very small).

INPUT FORMAT
    The input string passed to "match" must be a nucleotide sequence for now
    (protein sequences will be supported soon). There must be no line breaks
    or other whitespace, or any other kind of FASTA-like header/data.

    If your data does not conform to the description above then the results
    are undefined and you should sanitise your data before using this
    module.

    If your data is anything other than DNA (the default) you must call one
    of the type functions like "rna" or "protein":

        my $re = Bio::Regexp->new->rna->add('GAUAUC')->compile;

    Normally however "T" and "U" are both compiled into "[TU]" so your
    patterns will work on DNA and RNA. If you wish to prevent this and throw
    an error while compiling your regexp, call "strict_thymine_uracil".

    Unless "strict_case" is specified, the case of your patterns and the
    case of your input doesn't matter. I suggest using uppercase everywhere.

EXHAUSTIVE SEARCH
    Most methods of searching nucleotide sequences will only find
    non-overlapping matches in the input. For example, when searching for
    the sequence "AA" in the input "AAAA", perl's "m/AA/g" searches will
    only return 2 matches:

        AAAA
        --
          --

    With this module you get all three matches:

        AAAA
        --
         --
          --

    For DNA data this can be useful for finding the comprehensive set of
    possible molecules that could exist after a restriction enzyme cleaving.

INTERBASE COORDINATES
    All offsets returned by this module are in "interbase coordinates".
    Rather than the first base in a sequence being described as "base 1" as
    most biologists might think of it, or even "base 0" as computer
    scientists might, with interbase coordinates the first base is described
    as the sequence spanning coordinates 0 through 1.

    One of the reasons this is useful is because it allows us to
    unambiguously specify 0-width sequences like for example endonuclease
    cut sites. If index-style coordinates are used it is ambiguous whether
    the cut is before or after.

    Unlike with string indices, the start coordinate can be greater than the
    end coordinate. This happens when "double_stranded" is set (the default
    for DNA) and the pattern is found on the reverse complement strand. Use
    "single_stranded" if you don't want reverse complement matches.

    For circular inputs, interbase coordinates can also be greater than the
    length of the input. This is interpreted as wrapping back around to the
    beginning in a modular arithmetic fashion. Similarly, negative
    coordinates wrap around to the end of the input. "Out-of-range"
    interbase coordinates are only defined for circular inputs and
    referencing them on linear inputs will throw errors.

IUPAC SHORT FORMS
    For DNA and RNA, IUPAC incompletely specified nucleotide sequences can
    be used. These are analogous to regular expression character classes.
    Just like perl's "\s" is short for "[ \r\n\t]", in IUPAC form "V" is
    short for "[ACG]", or "[^T]". Unless "strict_thymine_uracil" is in
    effect this will actually be like "[^TU]" for both DNA and RNA inputs.

    See wikipedia <http://en.wikipedia.org/wiki/Nucleic_acid_notation> for
    the list of IUPAC short forms.

ADDING MULTIPLE SEARCH PATTERNS
    An important feature of this module is that any number of regular
    expressions can be combined into one so that many patterns can be
    searched for simultaneously while doing a single pass over the data.

    Doing a single pass is generally more efficient because of memory
    locality and has other positive side-effects. For instance, we can also
    scan a strand's reverse complement during the pass and therefore avoid
    copying and reversing the input (which may be quite large).

    This module should be able to support quite a large number of
    simultaneous search patterns although I have some ideas for future
    optimisations if they prove necessary. Large numbers of patterns may
    come in handy when building a list of all restriction enzymes that don't
    cut a target sequence, or finding all PCR primer sites accounting for
    IUPAC expanded primers.

    Multiple patterns can be added at once simply by calling "add()"
    multiple times before attempting a "match" (or a "compile"):

        my $re = Bio::Regexp->new;

        $re->add($_) for ('GAATTC', 'CCWGG');

        my @matches = $re->match($input);

    Which pattern matched is returned as the "match" key in the returned
    match results. You should probably have a hash of all your patterns so
    that you can look them up while processing matches. The way this is
    implemented is similar to the very useful Regexp::Assemble except
    without the hacks needed for ancient perl versions.

    When matching, only a single pass will be made over the data so as to
    find all possible locations that either of the added sequences could
    have matched. Large numbers of patterns should be fairly efficient
    because the perl 5.10+ regular expression engine uses a trie data
    structure for such patterns (and 5.10 is the minimum required perl for
    other reasons).

CIRCULAR INPUTS
    If the "circular" method is called, the search sequence "GAATTC" will
    match the following input:

        ATTCGGGGGGGGGGGGGGGGGGA
        ----                 --

    The "start" and "end" coordinates for one of the matches will be 21 and
    27. Since the input's length is only 23, we know that it must have
    wrapped around. In this case there will be another match of coordinates
    at 27 and 21 because "GAATTC" is a palindromic sequence.

    In order to make this efficient even with really long input sequences,
    this module copies only the maximum length your search pattern could
    possibly be. Being able to figure out the minimum and maximum sequence
    lengths is one of the reasons why the types of regular expressions you
    can use with this module are limited.

SEE ALSO
    Bio-Regexp github repo <https://github.com/hoytech/Bio-Regexp>

    Presentation about Bio::Regexp and more: Getting the most out of regular
    expressions <http://hoytech.github.io/regexp-presentation/>

    Bio::Tools::SeqPattern from the BioPerl distribution also allows the
    manipulation of patterns but is less advanced than this module. Also,
    the way Bio::Tools::SeqPattern reverses a regular expression in order to
    match the reverse complement is... wow. Just wow. :)

    Bio::Grep is an interface to various programs that search biological
    sequences. Bio::Grep::Backend::RE is probably the most comparable to
    this module.

    Bio::DNA::Incomplete

AUTHOR
    Doug Hoyte, "<doug@hcsw.org>"

COPYRIGHT & LICENSE
    Copyright 2013 Doug Hoyte.

    This module is licensed under the same terms as perl itself.
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)