NAME

BioUtil::Seq - Utilities for sequence

Some great modules like BioPerl provide many robust solutions. However, it is not easy to install for someone in some platforms. And for some simple task scripts, a lite module may be a good choice. So I reinvented some wheels and added some useful utilities into this module, hoping it would be helpful.

VERSION

Version 2015.0309

EXPORT

    FastaReader
    read_sequence_from_fasta_file 
    write_sequence_to_fasta_file 
    format_seq

    validate_sequence 
    complement
    revcom 
    base_content 
    degenerate_seq_to_regexp
    match_regexp
    dna2peptide 
    codon2aa 
    generate_random_seqence

    shuffle_sequences 
    rename_fasta_header 
    clean_fasta_header

SYNOPSIS

  use BioUtil::Seq;

SUBROUTINES/METHODS

FastaReader

FastaReader is a fasta file parser using closure. FastaReader returns an anonymous subroutine, when called, it return a fasta record which is reference of an array containing fasta header and sequence.

FastaReader could also read from STDIN when the file name is "STDIN" or "stdin".

A boolean argument is optional. If set as "true", spaces including blank, tab, "return" ("\r") and "new line" ("\n") symbols in sequence will not be trimed.

FastaReader speeds up by utilizing the special Perl variable $/ (set to "\n>"), with kind help of Mario Roy, author of MCE (https://code.google.com/p/many-core-engine-perl/). A lot of optimizations were also done by him.

Example:

   # do not trim the spaces and \n
   # $not_trim = 1;
   # my $next_seq = FastaReader("test.fa", $not_trim);
   
   # read from STDIN
   # my $next_seq = FastaReader('STDIN');
   
   # read from file
   my $next_seq = FastaReader("test.fa");

   while ( my $fa = &$next_seq() ) {
       my ( $header, $seq ) = @$fa;

       print ">$header\n$seq\n";
   }

read_sequence_from_fasta_file

Read all sequences from fasta file.

Example:

    my $seqs = read_sequence_from_fasta_file($file);
    for my $header (keys %$seqs) {
        my $seq = $$seqs{$header};
        print ">$header\n$seq\n";
    }

write_sequence_to_fasta_file

Example:

    my $seq = {"seq1" => "acgagaggag"};
    write_sequence_to_fasta_file($seq, "seq.fa");

format_seq

Format sequence to readable text

Example:

    printf ">%s\n%s", $head, format_seq($seq, 60);

validate_sequence

Validate a sequence.

Legale symbols:

    DNA: ACGTRYSWKMBDHV
    RNA: ACGURYSWKMBDHV
    Protein: ACDEFGHIKLMNPQRSTVWY
    gap and space: - *.

Example:

    if (validate_sequence($seq)) {
        # do some thing
    }

complement

Complement sequence

IUPAC nucleotide code: ACGTURYSWKMBDHVN

http://droog.gs.washington.edu/parc/images/iupac.html

    code    base    Complement
    A   A   T
    C   C   G
    G   G   C
    T/U T   A

    R   A/G Y
    Y   C/T R
    S   C/G S
    W   A/T W
    K   G/T M
    M   A/C K

    B   C/G/T   V
    D   A/G/T   H
    H   A/C/T   D
    V   A/C/G   B

    X/N A/C/G/T X
    .   not A/C/G/T
     or-    gap

my $comp = complement($seq);

revcom

Reverse complement sequence

my $recom = revcom($seq);

base_content

Example:

    my $gc_cotent = base_content('gc', $seq);

degenerate_seq_to_regexp

Translate degenerate sequence to regular expression

match_regexp

Find all sites matching the regular expression.

See https://github.com/shenwei356/bio_scripts/blob/master/sequence/fasta_locate_motif.pl

dna2peptide

Translate DNA sequence into a peptide

codon2aa

Translate a DNA 3-character codon to an amino acid

generate_random_seqence

Example:

    my @alphabet = qw/a c g t/;
    my $seq = generate_random_seqence( \@alphabet, 50 );

shuffle sequences

Example:

    shuffle_sequences($file, "$file.shuf.fa");

rename_fasta_header

Rename fasta header with regexp.

Example:

    # delete some symbols
    my $n = rename_fasta_header('[^a-z\d\s\-\_\(\)\[\]\|]', '', $file, "$file.rename.fa");
    print "$n records renamed\n";

clean_fasta_header

Rename given symbols to repalcement string. Because, some symbols in fasta header will cause unexpected result.

Example:

    my  $file = "test.fa";
    my $n = clean_fasta_header($file, "$file.rename.fa");
    # replace any symbol in (\/:*?"<>|) with '', i.e. deleting.
    # my $n = clean_fasta_header($file, "$file.rename.fa", '',  '\/:*?"<>|');
    print "$n records renamed\n";

To install BioUtil, copy and paste the appropriate command in to your terminal.

cpanm

cpanm BioUtil

CPAN shell

perl -MCPAN -e shell
install BioUtil

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

VERSION

EXPORT

SYNOPSIS

SUBROUTINES/METHODS

FastaReader

read_sequence_from_fasta_file

write_sequence_to_fasta_file

format_seq

validate_sequence

complement

revcom

base_content

degenerate_seq_to_regexp

match_regexp

dna2peptide

codon2aa

generate_random_seqence

shuffle sequences

rename_fasta_header

clean_fasta_header

Module Install Instructions