Bio::Cigar - Parse CIGAR strings and translate coordinates to/from reference/query
use 5.014; use Bio::Cigar; my $cigar = Bio::Cigar->new("2M1D1M1I4M"); say "Query length is ", $cigar->query_length; say "Reference length is ", $cigar->reference_length; my ($qpos, $op) = $cigar->rpos_to_qpos(3); say "Alignment operation at reference position 3 is $op"; my $query = "GCAAATGC"; my $ref = "AAAAGCAAATGC"; my $aln = $cigar->align($query, $ref, 5); # align query to pos 5 of ref say foreach @$aln;
Bio::Cigar is a small library to parse CIGAR strings ("Compact Idiosyncratic Gapped Alignment Report"), such as those used in the SAM file format. CIGAR strings are a run-length encoding which minimally describes the alignment of a query sequence to an (often longer) reference sequence.
Parsing follows the SAM v1 spec for the CIGAR column.
CIGAR
Parsed strings are represented by an object that provides a few utility methods.
All attributes are read-only.
The CIGAR string for this object.
The length of the reference sequence segment aligned with the query sequence described by the CIGAR string.
The length of the query sequence described by the CIGAR string.
An arrayref of [length, operation] tuples describing the CIGAR string. Lengths are integers, possible operations are below.
[length, operation]
The CIGAR operations are given in the following table, taken from the SAM v1 spec:
Op Description ‾‾‾‾‾‾‾‾‾‾‾‾‾‾‾ M alignment match (can be a sequence match or mismatch) I insertion to the reference D deletion from the reference N skipped region from the reference S soft clipping (clipped sequences present in SEQ) H hard clipping (clipped sequences NOT present in SEQ) P padding (silent deletion from padded reference) = sequence match X sequence mismatch • H can only be present as the first and/or last operation. • S may only have H operations between them and the ends of the string. • For mRNA-to-genome alignment, an N operation represents an intron. For other types of alignments, the interpretation of N is not defined. • Sum of the lengths of the M/I/S/=/X operations shall equal the length of SEQ.
Takes a CIGAR string as the sole argument and returns a new Bio::Cigar object.
Takes a reference position (origin 1, base-numbered) and returns the corresponding position (origin 1, base-numbered) on the query sequence. Indels affect how the numbering maps from reference to query.
In list context returns a tuple of [query position, operation at position]. Operation is a single-character string. See the table of CIGAR operations.
[query position, operation at position]
If the reference position does not map to the query sequence (as with a deletion, for example), returns undef or [undef, operation].
undef
[undef, operation]
Takes a query position (origin 1, base-numbered) and returns the corresponding position (origin 1, base-numbered) on the reference sequence. Indels affect how the numbering maps from query to reference.
In list context returns a tuple of [references position, operation at position]. Operation is a single-character string. See the table of CIGAR operations.
[references position, operation at position]
If the query position does not map to the reference sequence (as with an insertion, for example), returns undef or [undef, operation].
Takes a reference position and returns the operation at that position. Simply a shortcut for calling "rpos_to_qpos" in list context and discarding the first return value.
Takes a query position and returns the operation at that position. Simply a shortcut for calling "qpos_to_rpos" in list context and discarding the first return value.
Returns a new Bio::Cigar object with a CIGAR string that's the reverse of this one, i.e. the last operation becomes the first, the second-to-last the second, etc. until the first operation becomes the last.
Takes a query sequence and a reference sequence and aligns them according to the CIGAR string, using gap characters (-) for indels and spaces for soft clipping. This is pure string manipulation and as such the match and mismatch operators (= and X) are assumed to be correct for the given input sequences and not verified. Returns an array ref of [query seq, ref seq].
-
=
X
[query seq, ref seq]
Optionally, the leftmost reference position (origin 1) can be passed, i.e. the query is aligned starting at that position.
When $reversed is given a true value, the reverse complement of the passed query sequence is used to generate the alignment. Only the IUPAC nucleotide codes ATCGU are currently supported for reverse complementation.
$reversed
ATCGU
Thomas Sibley <trsibley@uw.edu>
Felix Kühnl <felix@bioinf.uni-leipzig.de>
Copyright 2014- Mullins Lab, Department of Microbiology, University of Washington.
This library is free software; you can redistribute it and/or modify it under the GNU General Public License, version 2.
SAMv1 spec
To install Bio::Cigar, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::Cigar
CPAN shell
perl -MCPAN -e shell install Bio::Cigar
For more information on module installation, please visit the detailed CPAN module installation guide.