Algorithm::NeedlemanWunsch - sequence alignment with configurable scoring
Version 0.02
use Algorithm::NeedlemanWunsch; sub score_sub { if (!@_) { return -2; # gap penalty } return ($_[0] eq $_[1]) ? 1 : -1; } my $matcher = Algorithm::NeedlemanWunsch->new(\&score_sub); my $score = $matcher->align( \@a, \@b, { align => \&on_align, shift_a => \&on_shift_a, shift_b => \&on_shift_b, select_align => \&on_select_align });
Sequence alignment is a way to find commonalities in two (or more) similar sequences or strings of some items or characters. Standard motivating example is the comparison of DNA sequences and their functional and evolutionary similarities and differences, but the problem has much wider applicability - for example finding the longest common subsequence (that is, diff
) is a special case of sequence alignment.
Conceptually, sequence alignment works by scoring all possible alignments and choosing the alignment with maximal score. For example, sequences a t c t
and t g a t
may be aligned
sequence A: a t c - t | | | sequence B: - t g a t
or
sequence A: - - a t c t | | sequence B: t g a t - -
(and exponentially many other ways, of course). Note that Needleman-Wunsch considers global alignments, over the entire length of both sequences; each item is either aligned with an item of the other sequence, or corresponds to a gap (which is always aligned with an item - aligning two gaps wouldn't help anything). This approach is especially suitable for comparing sequences of comparable length and somewhat similar along their whole lengths - that is, without long stretches that have nothing to do with each other. If your sequences don't satisfy these requirements, consider using local alignment, which, strictly speaking, isn't Needleman-Wunsch, but is similar enough to be implemented in this module as well - see below for details.
In the example above, the second alignment has more gaps than the first, but perhaps your a's are structurally important and you like them lined up so much that you'd still prefer the second alignment. Conversely, if c is "almost the same" as g, it might be the first alignment that matches better. Needleman-Wunsch formalizes such considerations into a similarity matrix, assigning payoffs to each (ordered, but the matrix is normally symmetrical so that the order doesn't matter) pair of possible sequence items, plus a gap penalty, quantifying the desirability of a gap in a sequence. A preference of pairings over gaps is expressed by a low (relative to the similarity matrix values, normally negative) gap penalty.
The alignment score is then defined as the sum, over the positions where at least one sequence has an item, of the similarity matrix values indexed by the first and second item (when both are defined) and gap penalties (for items aligned with a gap). For example, if S
is the similarity matrix and g
denotes the gap penalty, the alignment
sequence A: a a t t c c sequence B: a - - - t c
has score S[a, a] + 3 * g + S[c, t] + S[c, c]
.
When the gap penalty is 0 and the similarity an identity matrix, i.e. assigning 1 to every match and 0 to every mismatch, Needleman-Wunsch reduces to finding the longest common subsequence.
The algorithm for maximizing the score is a standard application of dynamic programming, computing the optimal alignment score of empty and 1-item sequences and building it up until the whole input sequences are taken into consideration. Once the optimal score is known, the algorithm traces back to find the gap positions. Note that while the maximal score is obviously unique, the alignment having it in general isn't; this module's interface allows the calling application to choose between different optimal alignments.
The constructor. Takes one mandatory argument, which is a coderef to a sub implementing the similarity matrix, plus an optional gap penalty argument. If the gap penalty isn't specified as a constructor argument, the Algorithm::NeedlemanWunsch
object gets it by calling the scoring sub without arguments; apart from that case, the sub is called with 2 arguments, which are items from the first and second sequence, respectively, passed to Algorithm::NeedlemanWunsch::align
. Note that the sub must be pure, i.e. always return the same value when called with the same arguments.
The core of the algorithm. Creates a bottom-up dynamic programming matrix, fills it with alignment scores and then traces back to find an optimal alignment, informing the application about its items by invoking the callbacks passed to the method.
The first 2 arguments of align
are array references to the aligned sequences, the third a hash reference with user-supplied callbacks. The callbacks are identified by the hash keys, which are as follows:
Aligns two sequence items. The callback is called with 2 arguments, which are the positions of the paired items in \@a
and \@b
, respectively.
Aligns an item of the first sequence with a gap in the second sequence. The callback is called with 1 argument, which is the position of the item in \@a
.
Aligns a gap in the first sequence with an item of the second sequence. The callback is called with 1 argument, which is the position of the item in \@b
.
Called when there's more than one way to construct the optimal alignment, with 1 argument which is a hashref enumerating the possibilities. The hash may contain the following keys:
If this key exists, the optimal alignment may align two sequence items. The key's value is an arrayref with the positions of the paired items in \@a
and \@b
, respectively.
If this key exists, the optimal alignment may align an item of the first sequence with a gap in the second sequence. The key's value is the position of the item in \@a
.
If this key exists, the optimal alignment may align a gap in the first sequence with an item of the second sequence. The key's value is the position of the item in \@b
.
All keys are optional, but the hash will always have at least one. The callback must select one of the possibilities by returning one of the keys.
All callbacks are optional. When there is just one way to make the optimal alignment, the Algorithm::NeedlemanWunsch
object prefers calling the specific callbacks, but will call select_align
if it's defined and the specific callback isn't.
Note that the passed positions move backwards, from the sequence ends to zero - if you're building the alignment in your callbacks, add items to the front.
In addition to the standard Needleman-Wunsch algorithm, this module also implements two popular extensions: local alignment and affine block gap penalties. Use of both extensions is controlled by setting the properties of Algorithm::NeedlemanWunsch
object described below.
When this flag is set before calling Algorithm::NeedlemanWunsch::align
, the alignment scoring doesn't charge the gap penalty for gaps at the beginning (i.e. before the first item) and end (after the last item) of the second sequence passed to align
, so that for example the optimal (with identity matrix as similarity matrix and a negative gap penalty) alignment of a b c d e f g h
and b c h
becomes
sequence A: a b c d e f g h | | sequence B: b c h
instead of the global
sequence A: a b c d e f g h | | | sequence B: - b c - - - - h
Note that local alignment is asymmetrical - when using it, the longer sequence should be the first passed to Algorithm::NeedlemanWunsch::align
.
Using the same gap penalty for every gap has the advantage of simplicity, but some applications may want a more complicated approach. Biologists, for example, looking for gaps longer than one DNA sequence base, typically want to distinguish a gap opening (costly) from more missing items following it (shouldn't cost so much). That requirement can be modelled by charging 2 gap penalties: gap_open_penalty
for the first gap, and then gap_extend_penalty
for each consecutive gap on the same sequence.
Note that you must set both these properties if you set any of them and that gap_open_penalty
must be less than gap_extend_penalty
(if you know of a use case where the gap opening penalty should be preferred to gap extension, let me know). With such penalties set before calling Algorithm::NeedlemanWunsch::align
, sequences A T G T A G T G T A T A G T A C A T G C A
and A T G T A G T A C A T G C A
are aligned
sequence A: A T G T A G T G T A T A G T A C A T G C A | | | | | | | | | | | | | | sequence B: A T G - - - - - - - T A G T A C A T G C A
i.e. with all gaps bunched together.
Algorithm::Diff, Text::WagnerFischer
Vaclav Barta, <vbar@comp.cz>
Please report any bugs or feature requests to bug-algorithm-needlemanwunsch at rt.cpan.org
, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Algorithm-NeedlemanWunsch. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
Copyright 2007 Vaclav Barta, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
The algorithm is defined by Saul Needleman and Christian Wunsch in "A general method applicable to the search for similarities in the amino acid sequence of two proteins", J Mol Biol. 48(3):443-53.
This implementation is based mostly on http://www.ludwig.edu.au/course/lectures2005/Likic.pdf, local alignment is from http://www.techfak.uni-bielefeld.de/bcd/Curric/PrwAli/node6.html.