Bryan Jurish > PDL-EditDistance-0.06 > PDL::EditDistance

PDL-EditDistance-0.06.tar.gz

Dependencies

Annotate this POD

# CPAN RT

 Open 1
View/Report Bugs
Module Version: 0.06

# NAME

PDL::EditDistance - Wagner-Fischer edit distance and alignment for PDLs.

# SYNOPSIS

``` use PDL;
use PDL::EditDistance;

##-- input PDLs
\$a = pdl([map { ord(\$_) } qw(G U M B O)]);
\$b = pdl([map { ord(\$_) } qw(G A M B O L)]);

\$a1 = pdl([0, map { ord(\$_) } qw(G U M B O)]);
\$b1 = pdl([0, map { ord(\$_) } qw(G A M B O L)]);

##-------------------------------------------------------------
## Levenshtein distance
\$dist          = edit_distance_static(\$a,\$b, 0,1,1,1);
(\$dist,\$align) = edit_align_static(\$a,\$b, 0,1,1,1);

##-------------------------------------------------------------
## Wagner-Fischer distance
@costs         = (\$costMatch=0,\$costInsert=1,\$costDelete=1,\$costSubstitute=2);
\$dist          = edit_distance_static(\$a,\$b, @costs);
(\$dist,\$align) = edit_align_static(\$a,\$b, @costs);

##-------------------------------------------------------------
## General edit distance
\$costsMatch = random(\$a->nelem+1, \$b->nelem+1);
\$costsIns   = random(\$a->nelem+1, \$b->nelem+1);
\$costsDel   = random(\$a->nelem+1, \$b->nelem+1);
\$costsSubst = random(\$a->nelem+1, \$b->nelem+1);
@costs         = (\$costsMatch,\$costsIns,\$costDel,\$costsSubst);
\$dist          = edit_distance_full(\$a,\$b,@costs);
(\$dist,\$align) = edit_align_full(\$a,\$b,@costs);

##-------------------------------------------------------------
## Alignment
\$op_match = align_op_match();      ##-- constant
\$op_del   = align_op_insert1();    ##-- constant
\$op_ins   = align_op_insert2();    ##-- constant
\$op_subst = align_op_substitute(); ##-- constant

(\$apath,\$bpath,\$pathlen) = edit_bestpath(\$align);
(\$ai,\$bi,\$ops,\$pathlen)  = edit_pathtrace(\$align);

##-------------------------------------------------------------
## Longest Common Subsequence
\$lcs = edit_lcs(\$a,\$b);
(\$ai,\$bi,\$lcslen) = lcs_backtrace(\$a,\$b,\$lcs);```

# FUNCTIONS

## _edit_pdl

`  Signature: (a(N); [o]apdl(N+1))`

Convenience method. Returns a pdl \$apdl() suitable for representing \$a(), which can be specified as a string, arrays of numbers, or as a PDL. \$apdl(0) is always set to zero.

## edit_costs

```  Signature: (PDL::Type type; int N; int M;
[o]costsMatch(N+1,M+1); [o]costsIns(N+1,M+1); [o]costsDel(N+1,M+1); [o]costsSubst(N+1,M+1))```

Convenience method. Ensures existence and proper dimensionality of cost matrices for inputs of length N and M.

## _edit_costs

```  Signature: (PDL::Type type; int N1; int M1;
[o]costsMatch(N1,M1); [o]costsIns(N1,M1); [o]costsDel(N1,M1); [o]costsSubst(N1,M1))```

Low-level method. Ensures existence and proper dimensionality of cost matrices for inputs of length N1-1 and M1-1.

## edit_costs_static

```  Signature: (PDL::Type type; int N; int M;
staticCostMatch(); staticCostIns(); staticCostSubst();
[o]costsMatch(N+1,M+1); [o]costsIns(N+1,M+1); [o]costsDel(N+1,M+1); [o]costsSubst(N+1,M+1))```

Convenience method.

## edit_distance_full

```  Signature: (a(N); b(M);
costsMatch(N+1,M+1); costsIns(N+1,M+1); costsDel(N+1,M+1); costsSubst(N+1,M+1);
[o]dist(N+1,M+1); [o]align(N+1,M+1))```

Convenience method. Compute the edit distance matrix for inputs \$a() and \$b(), and cost matrices \$costsMatch(), \$costsIns(), \$costsDel(), and \$costsSubst(). \$a() and \$b() may be specified as PDLs, arrays of numbers, or as strings.

## _edit_distance_full

`  Signature: (a1(N1); b1(M1); costsMatch(N1,M1); costsIns(N1,M1); costsDel(N1,M1); costsSubst(N1,M1); [o]dist(N1,M1))`

Low-level method. Compute the edit distance matrix for input PDLs \$a1() and \$b1() and cost matrices \$costsMatch(), \$costsIns(), \$costsDel(), and \$costsSubst().

The first elements of \$a1() and \$b1() are ignored.

_edit_distance_full does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## edit_align_full

```  Signature: (a(N); b(M);
costsMatch(N+1,M+1); costsIns(N+1,M+1); costsDel(N+1,N+1); costsSubst(N+1,M+1);
[o]dist(N+1,M+1); [o]align(N+1,M+1))```

Convenience method. Compute the edit distance and alignment matrices for inputs \$a() and \$b(), and cost matrices \$costsMatch(), \$costsIns(), \$costsDel(), and \$costsSubst(). \$a() and \$b() may be specified as PDLs, arrays of numbers, or as strings.

## _edit_align_full

`  Signature: (a1(N1); b1(M1); costsMatch(N1,M1); costsIns(N1,M1); costsDel(N1,M1); costsSubst(N1,M1); [o]dist(N1,M1); byte [o]align(N1,M1))`

Low-level method. Compute the edit distance and alignment matrix for input PDLs \$a1() and \$b1() and cost matrices \$costsMatch(), \$costsIns(), \$costsDel(), and \$costsSubst().

The first elements of \$a1() and \$b1() are ignored.

_edit_align_full does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## edit_distance_static

```  Signature: (a(N); b(M);
staticCostMatch(); staticCostIns(); staticCostDel(); staticCostSubst();
[o]dist(N+1,M+1))```

Convenience method. Compute the edit distance matrix for inputs \$a() and \$b() given a static cost schema @costs = (\$staticCostMatch(), \$staticCostIns(), \$staticCostDel(), and \$staticCostSubst()). \$a() and \$b() may be specified as PDLs, arrays of numbers, or as strings. Functionally equivalent to edit_distance_full(\$matches,@costs,\$dist), but slightly faster.

## _edit_distance_static

`  Signature: (a1(N1); b1(M1); costMatch(); costIns(); costDel(); costSubst(); [o]dist(N1,M1))`

Low-level method. Compute the edit distance matrix for input PDLs \$a1() and \$b1() given a static cost schema @costs = (\$costMatch(), \$costIns(), \$costDel(), \$costSubst()). Functionally identitical to _edit_distance_matrix_full(\$matches,@costs,\$dist), but slightly faster.

The first elements of \$a1() and \$b1() are ignored.

_edit_distance_static does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## edit_align_static

```  Signature: (a(N); b(M);
staticCostMatch(); staticCostIns(); staticCostDel(); staticCostSubst();
[o]dist(N+1,M+1); [o]align(N+1,M+1))```

Convenience method. Compute the edit distance and alignment matrices for inputs \$a() and \$b() given a static cost schema @costs = (\$staticCostMatch(), \$staticCostIns(), \$staticCostDel(), and \$staticCostSubst()). \$a() and \$b() may be specified as PDLs, arrays of numbers, or as strings. Functionally equivalent to edit_align_full(\$matches,@costs,\$dist), but slightly faster.

## _edit_align_static

`  Signature: (a1(N1); b1(M1); costMatch(); costIns(); costDel(); costSubst(); [o]dist(N1,M1); byte [o]align(N1,M1))`

Low-level method. Compute the edit distance and alignment matrices for input PDLs \$a1() and \$b1() given a static cost schema @costs = (\$costMatch(), \$costIns(), \$costDel(), \$costSubst()). Functionally identitical to _edit_distance_matrix_full(\$matches,@costs,\$dist), but slightly faster.

The first elements of \$a1() and \$b1() are ignored.

_edit_align_static does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## align_op_insert1

`  Signature: ([o]a())`

Alignment matrix value constant for insertion operations on \$a() string.

align_op_insert1 does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## align_op_insert2

`  Signature: ([o]a())`

Alignment matrix value constant for insertion operations on \$a() string.

align_op_insert2 does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## align_op_match

`  Signature: ([o]a())`

Alignment matrix value constant for matches.

align_op_match does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## align_op_substitute

`  Signature: ([o]a())`

Alignment matrix value constant for substitution operations.

align_op_substitute does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## align_op_delete

Alias for align_op_insert1()

## align_op_insert

Alias for align_op_insert2()

## align_ops

`  Signature: ([o]ops(4))`

Alignment matrix value constants 4-element pdl (match,insert1,insert2,substitute).a

## edit_bestpath

`  Signature: (align(N+1,M+1); [o]apath(N+M+2); [o]bpath(N+M+2); [o]pathlen())`

Convenience method. Compute best path through alignment matrix \$align(). Stores paths for original input strings \$a() and \$b() in \$apath() and \$bpath() respectively. Negative values in \$apath() and \$bpath() indicate insertion/deletion operations. On completion, \$pathlen() holds the actual length of the paths.

## _edit_bestpath

`  Signature: (align(N1,M1); int [o]apath(L); int [o]bpath(L); int [o]len(); int ifinal; int jfinal)`

Low-level method. Compute best path through alignment matrix \$align() from final index (\$ifinal,\$jfinal). Stores paths for (original) input strings \$a() and \$b() in \$apath() and \$bpath() respectively. Negative values in \$apath() and \$bpath() indicate insertion/deletion operations. On completion, \$pathlen() holds the actual length of the paths.

_edit_bestpath does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## edit_pathtrace

`  Signature: ( align(N+1,M+1); [o]ai(L); [o]bi(L); [o]ops(L); [o]\$pathlen() )`

Convenience method. Compute alignment path backtrace through alignment matrix \$align() from final index (\$ifinal,\$jfinal). Stores raw paths for (original) input strings \$a() and \$b() in \$ai() and \$bi() respectively. Unlike edit_bestpath(), null-moves for \$ai() and \$bi() are not stored here as negative values. Returned pdls (\$ai,\$bi,\$ops) are trimmed to the appropriate path length.

## _edit_pathtrace

`  Signature: (align(N1,M1); int [o]ai(L); int [o]bi(L); int [o]ops(L); int [o]len(); int ifinal; int jfinal)`

Low-level method. Compute alignment path backtrace through alignment matrix \$align() from final index (\$ifinal,\$jfinal). Stores raw paths for (original) input strings \$a() and \$b() in \$ai() and \$bi() respectively. Unlike edit_bestpath(), null-moves for \$ai() and \$bi() are not stored here as negative values. Returned pdls (\$ai,\$bi,\$ops) are trimmed to the appropriate path length.

_edit_pathtrace does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## edit_lcs

`  Signature: (a(N); b(M); int [o]lcs(N+1,M+1);)`

Convenience method. Compute the longest common subsequence (LCS) matrix for input PDLs \$a1() and \$b1(). The output matrix \$lcs() contains at cell (\$i+1,\$j+1) the length of the LCS between \$a1(0..\$i) and \$b1(0..\$j); thus \$lcs(\$N,\$M) contains the length of the LCS between \$a() and \$b().

## _edit_lcs

`  Signature: (a1(N1); b1(M1); int [o]lcs(N1,M1))`

Low-level method. Compute the longest common subsequence (LCS) matrix for input PDLs \$a1() and \$b1(). The initial (zeroth) elements of \$a1() and \$b1() are ignored. The output matrix \$lcs() contains at cell (\$i,\$j) the length of the LCS between \$a1(1..\$i) and \$b1(1..\$j); thus \$lcs(\$N1-1,\$M1-1) contains the length of the LCS between \$a1() and \$b1().

_edit_lcs does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

## lcs_backtrace

`  Signature: (a(N); b(M); int lcs(N+1,M+1); int ifinal(); int jfinal(); int [o]ai(L); int [o]bi(L); int [o]len())`

Convenience method. Compute longest-common-subsequence backtrace through LCS matrix \$lcs() for original input strings (\$a(),\$b()) from final index (\$ifinal,\$jfinal). Stores raw paths for (original) input strings \$a() and \$b() in \$ai() and \$bi() respectively.

## _lcs_backtrace

`  Signature: (a1(N1); b1(M1); int lcs(N1,M1); int ifinal(); int jfinal(); [o]ai(L); [o]bi(L); int [o]len())`

Low-level method. Compute longest-common-subsequence backtrace through LCS matrix \$lcs() for initial-padded strings (\$a1(),\$b1()) from final index (\$ifinal,\$jfinal). Stores raw paths for (original) input strings \$a() and \$b() in \$ai() and \$bi() respectively.

_lcs_backtrace does not process bad values. It will set the bad-value flag of all output piddles if the flag is set for any of the input piddles.

# ACKNOWLEDGEMENTS

Perl by Larry Wall.

PDL by Karl Glazebrook, Tuomas J. Lukka, Christian Soeller, and others.

Probably many.

# AUTHOR

Bryan Jurish <moocow@cpan.org>