BioUtil::Seq - Utilities for sequence
Some great modules like BioPerl provide many robust solutions. However, it is not easy to install for someone in some platforms. And for some simple task scripts, a lite module may be a good choice. So I reinvented some wheels and added some useful utilities into this module, hoping it would be helpful.
Version 2015.0309
FastaReader read_sequence_from_fasta_file write_sequence_to_fasta_file format_seq validate_sequence complement revcom base_content degenerate_seq_to_regexp match_regexp dna2peptide codon2aa generate_random_seqence shuffle_sequences rename_fasta_header clean_fasta_header
use BioUtil::Seq;
FastaReader is a fasta file parser using closure. FastaReader returns an anonymous subroutine, when called, it return a fasta record which is reference of an array containing fasta header and sequence.
FastaReader could also read from STDIN when the file name is "STDIN" or "stdin".
A boolean argument is optional. If set as "true", spaces including blank, tab, "return" ("\r") and "new line" ("\n") symbols in sequence will not be trimed.
FastaReader speeds up by utilizing the special Perl variable $/ (set to "\n>"), with kind help of Mario Roy, author of MCE (https://code.google.com/p/many-core-engine-perl/). A lot of optimizations were also done by him.
Example:
# do not trim the spaces and \n # $not_trim = 1; # my $next_seq = FastaReader("test.fa", $not_trim); # read from STDIN # my $next_seq = FastaReader('STDIN'); # read from file my $next_seq = FastaReader("test.fa"); while ( my $fa = &$next_seq() ) { my ( $header, $seq ) = @$fa; print ">$header\n$seq\n"; }
Read all sequences from fasta file.
my $seqs = read_sequence_from_fasta_file($file); for my $header (keys %$seqs) { my $seq = $$seqs{$header}; print ">$header\n$seq\n"; }
my $seq = {"seq1" => "acgagaggag"}; write_sequence_to_fasta_file($seq, "seq.fa");
Format sequence to readable text
printf ">%s\n%s", $head, format_seq($seq, 60);
Validate a sequence.
Legale symbols:
DNA: ACGTRYSWKMBDHV RNA: ACGURYSWKMBDHV Protein: ACDEFGHIKLMNPQRSTVWY gap and space: - *.
if (validate_sequence($seq)) { # do some thing }
Complement sequence
IUPAC nucleotide code: ACGTURYSWKMBDHVN
http://droog.gs.washington.edu/parc/images/iupac.html
code base Complement A A T C C G G G C T/U T A R A/G Y Y C/T R S C/G S W A/T W K G/T M M A/C K B C/G/T V D A/G/T H H A/C/T D V A/C/G B X/N A/C/G/T X . not A/C/G/T or- gap
my $comp = complement($seq);
Reverse complement sequence
my $recom = revcom($seq);
my $gc_cotent = base_content('gc', $seq);
Translate degenerate sequence to regular expression
Find all sites matching the regular expression.
See https://github.com/shenwei356/bio_scripts/blob/master/sequence/fasta_locate_motif.pl
Translate DNA sequence into a peptide
Translate a DNA 3-character codon to an amino acid
my @alphabet = qw/a c g t/; my $seq = generate_random_seqence( \@alphabet, 50 );
shuffle_sequences($file, "$file.shuf.fa");
Rename fasta header with regexp.
# delete some symbols my $n = rename_fasta_header('[^a-z\d\s\-\_\(\)\[\]\|]', '', $file, "$file.rename.fa"); print "$n records renamed\n";
Rename given symbols to repalcement string. Because, some symbols in fasta header will cause unexpected result.
my $file = "test.fa"; my $n = clean_fasta_header($file, "$file.rename.fa"); # replace any symbol in (\/:*?"<>|) with '', i.e. deleting. # my $n = clean_fasta_header($file, "$file.rename.fa", '', '\/:*?"<>|'); print "$n records renamed\n";
To install BioUtil, copy and paste the appropriate command in to your terminal.
cpanm
cpanm BioUtil
CPAN shell
perl -MCPAN -e shell install BioUtil
For more information on module installation, please visit the detailed CPAN module installation guide.