The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
$Config{perladmin} = "Andrew.Torda@anu.edu.au";
=pod

=head1 NAME

wurst - Wurst is a Useful, Reliable Structural Tool

=head1 SYNOPSIS

B<perl> I<script_file>

=head1 DESCRIPTION

    use Wurst;

or, depending on the directory you are running from,

    use FindBin;
    use lib "$FindBin::Bin/../src/Wurst/blib/arch";
    use lib "$FindBin::Bin/../src/Wurst/blib/lib";
    use Wurst;

=head1 INTRODUCTION

There is no monolithic program, wurst. Instead, this is a
framework which works mostly with proteins and lets one do things
like sequence alignments, sequence to structure alignments,
and pure structural alignments.

The philosophy is that there is a set of low level routines
coded in C. These are called from perl. To do anything useful,
you write a little perl script which calls the functions you want.

Scripts have been written to do things such as parameter
optimization using a simplex or protein structure-based phylogeny using
a guide tree-based method.

=head1 DATA TYPES

We define certain kinds of pieces of data. In general, the
interpreter calls a function which returns a pointer to some
piece of data with a label. This may be passed into other
function calls. When the last reference to the piece of data
disappears, the object will be cleared up.

For example, C<seq_read()> returns something of type
SeqPtr. This can then be passed to other functions like
C<make_model()>.

The scheme is such that we have some kind of type
checking.
For example, make_model has an interface like

 make_model ( p, seq, coord)

and the interpreter will only make the call if the first
argument is a Pair_set, the second a sequence and the third
some coordinates.

The set of data types includes:

=over

=item Aa_clssfcn

=item Clssfcn

=item Coord

=item FXParam

=item Pair_set

=item Param

=item Prob_vec

=item Sec_s_data

=item Seq

=item Seq_array

=item Sub_mat

=item Score_mat

=item Scor_set

=item Seqprof

=back

=head1 PROBABILITY VECTOR NOTES

This is a special section to point one in the right direction to
read about the functions which create probability vectors. These
are described in the full function list, but they can
be hard to find, so they are gathered together here.

=over

=item seq_2_prob_vec

Create a probability vector from a sequence structure, using a
classification that is purely sequence based.

=item struct_2_prob_vec

Create a probability vector from a structure, using a
classification that is purely sequence based.

=item aa_strct_2_prob_vec

Create a probability vector from a structure, but using a
classification based on sequence and structure information.
Use both the sequence and structure data.

=item aa_2_prob_vec SEQ CLSSFCN

=item aa_2_prob_vec SEQ CLSSFCN NORM

Create a probability vector from a sequence SEQ, but using a
classification CLSSFCN based on sequence and structure information. This
is the same classification as above in
C<aa_strct_2_prob_vec>. The third argument can be used to
turn off normalising of vectors by passing a value of S<0
(zero)>. By default, S<NORM = 1>.

=back

=head1 FUNCTIONS

=over

=item ac_nclass AA_CLSSFCN

B<AA_CLSSFCN> is an amino acid fragment classification. Return
the number of classes in the classification.

=item ac_dump AA_CLSSFCN

B<AA_CLSSFCN> is an amino acid fragment classfication.
Print some information to stdout about the classication.
This is useful when checking if a classification has been read
properly. 

=item ac_nclass AA_CLSSFCN

B<AA_CLSSFCN> is an amino acid fragment classfication.
Return the number of classes.

=item ac_size AA_CLSSFCN

B<AA_CLSSFCN> is an amino acid fragment classfication.
Return the fragment length. The intention is only simple for the
case of sequence classes. For structural classifications, there
may be more than I<N> descriptors for I<N> residues.

=item ac_read FNAME

Read an amino acid fragment classification.
Return a B<Aa_clssfcn> object if successful, undef otherwise.

=item ac_size AA_CLSSFCN
B<AA_CLSSFCN> is an amino acid fragment classification.
Return the fragment size used in the classification.


=item blst_chk_read FNAME

Read a BLAST checkpoint file from B<FNAME>. Return a B<Seqprof>
object if successful, undef otherwise.

=item blst_chk_write FNAME Seqprof

Writes a BLAST checkpoint file to B<FNAME>. Returns 1 if successful,
and 0 otherwise. This profile is functionally equivalent to profiles
written by psi-blast and used by wurst, but not byte-equivalent (due
to floating point noise).


=item get_clssfcn FNAME abs_error

Reads an classification generated by AutoClass-C.
Do not forget to give an absolut error value from your original data
(it will be used for the integration in computeMembership() ).
Returns a B<Clssfcn> object if successful, undef otherwise.


=item coord_2_bin COORD FNAME

Take a B<Coord> structure and write it, in our proprietary .bin
format, to FNAME.  Returns 1 on success and undef on failure.

This function may be used when writing the results of a
calculation, but more likely, it is used when converting PDB
files to our C<.bin> format. Look at pdb_read.

=item coord_c_n_dist COORD i, j, sqrt_flag

Given a B<Coord> structure, COORD and the index of two residues,
B<i> and B<j> return the distance, in Angstroms between the
carbonyl oxygen of B<i> and the backbone nitrogen atom of
B<j> if B<sqrt_flag> is set to non-zero. If the flag is left as
zero or B<undef>, then the value returned will be the distance
squared.

In a protein, the bondlength is about 1.32 Angstrom. Many scripts
just want to check for continuous chains, so they do not need the
real distance and it is faster to avoid the square root. For
example, to check if the third and fourth residues are bonded,
you might say

    my $dist2 = 1.5 * 1.5;
    if (coord_c_n_dist ($coord, 2, 3, 0) > $dist2) {
        print "not connected\n";
    } else {
        print "connected\n";
    }

=item coord_deletion COORD Start Coord_Length Seq_Length

Returns a pointer to a new B<Coord> (structure) where an
affine gap has been introduced, arising from the deletion
of some sequence or coordinate data from an existing structure.
The gap is specified by its B<Start> [0,size), and number
of residues excised from the structure (B<Coord_Length>)
and sequence (B<Seq_Length>).

Minimal checking ensures gap specifications are sensible.
This function is provided for the generation of synthetic
threading results and other forms of improper structure
sets.

=item coord_geo_gap COORD SCALE MAX

Calculate a penalty based on the geometric damage in a
structure. This is based on the carbonyl carbon to amide nitrogen
distance between each pair of residues which should be
connected. The ideal C..N distance is 1.32 Angstrom. No penalty
is enforced if the distance is less than 2.0 Angstrom.
COORD is a set of coordinates.  SCALE is the number
with which penalties will be multiplied.  MAX is a limit on the
distance to be considered. For each gap, if it is bigger than
MAX, it is set to MAX.

This function returns a list of values like:

 ($quadratic, $linear, $logistic, $num_gap)
              = coord_geo_gap ($coord, $scale, $max);

Where

=over

=item $quadratic

Sum over (distance - ideal)^2 for each gap, where ideal is hard
coded to about 1.32 Angstrom.

=item $linear

This is the sum over (distance - ideal) for each gap.

=item $logistic

This uses a logistic activation function. The penalty is the sum
of a logistic activation function applied to each gap.

=item $num_gap

This is the number of gaps, regardless of length.

=back

=item coord_get_seq COORD

Returns a pointer to a B<Seq> (sequence) given a pointer to a
B<Coord>.

=item coord_has_sec_s COORD

Returns non-zero if B<COORD> has defined secondary structure.

=item coord_get_sec_s COORD

Returns a string containing the secondary structure assignment
for the coordinates, if they exist. Side-effect: a non-serious
warning message is generated if COORD has no assigned secondary
structure.

=item coord_name COORD

Print the PDB acquisition code and chain identifier from the
coordinates, COORD (an object of C<Coord> type).

=item coord_read FNAME

Read coordinates from file FNAME. Returns a B<Coord> pointer.

=item coord_rmsd PAIR_SET, COORD1, COORD2, SUBSET_FLAG

=item coord_rmsd PAIR_SET, COORD1, COORD2

Superimpose COORD1 onto COORD2 and calculate the root mean square
difference of coordinates in Angstroms, based on an alignment. Return
a list with shifted coordinates and the rmsd value like this

  ($rmsd, $new_c1, $new_c2) = rmsd ($pair_set, $c1, $c2)

or

  ($rmsd, $new_c1, $new_c2) = rmsd ($pair_set, $c1, $c2, SUBSET_FLAG)

$rmsd is the root mean square difference. $new_c1 is a coordinate
object which has been moved. It may be smaller than $c1 (see note
below about SUBSET_FLAG). $new_c2 may be identical to $c2.

$pair_set has come from some kind of alignment such as sequence to
sequence or structure to structure. C<$c1> are the coordinates to be
moved. C<$c2> are the reference coordinates. If defined, the optional
argument, SUBSET_FLAG, means that one walks down the list coordinates
of C<$c1> and C<$c2> and copies into C<$new_c1> and C<$new_c2> those
sites which were aligned.

If the alignment of two proteins includes most of the original
coordinates, then one will not usually want to define SUBSET_FLAG. If
the two proteins are very different sizes, it makes sense to define
this flag (and only see the similar regions).
If SUBSET_FLAG is not set, $new_c2 is an exact copy of $c2.

=item coord_size COORD

Return an integer with the number of residues in COORD, a
C<Coord> object.

=item coord_2_pdb FNAME COORD SEQ

=item coord_2_pdb FNAME COORD

Write the coordinates from COORD (a B<Coord> pointer) to the
filename given by FNAME. Returns nonzero on success and the
undefined value otherwise.

If the optional last argument, SEQ, is present, it should be a
sequence object and will be used to print the original
(hopefully) complete sequence in the pdb file in standard, PDB,
SEQRES format records.

=item coord_2_spdb FNAME COORD SCOR_SET

=item coord_2_spdb FNAME COORD SCOR_SET SEQ

Identical to B<coord_2_pdb> (above), this function additionally
writes a temperature value in the generated PDB file, corresponding
to the score given in SCOR_SET (a B<Scor_set> pointer) for each
residue in COORD.

See the entry for the function B<scor_set_simpl> (below) for
an example.


=item coord_2_pnlty COORD VALUE

COORD is a C<Coord> type object. VALUE is a floating point
number.
Returns a FloatPtr (floating point array) which will be used
for extra gap penalty weights during alignments.
At each site in COORD, see if it is in secondary structure. If
so, it gets an extra weight, given by VALUE. The exact method
is more complicated.

If we have a structure which looks like

    1 2 3 4 5 6 7
    - E E E H - -

The we say that "E" and "H" refer to extended (beta strand)
and helix, so residues S<2 - 5> look as if they are in
interesting secondary structure. In fact, the situation is
more interesting. Residue 1 is adjoining a piece of secondary
structure, and residue 2 is on the edge of one. Only residues
S<3 - 4> are really definitely inside secondary structure and
residue 7 is definitely outside one.
If VALUE is set to 10, then the coefficients will be

 residue  coefficient
 1        3.25
 2        7.75
 3       10
 4       10
 5        7.75
 6        3.25
 7        1

=item dmat_b_cliques MODEL REFERENCE BOUND SIZE

Returns a list of intervals on MODEL's coordinates
corresponding to 'flat' areas of the difference of distance
matrices between MODEL and REFERENCE (which could be the native
fold of the sequence modelled by MODEL).

The list should be interpreted as ranges in the residues
of MODEL :

   start1, end1, start2, end2, ...

These define contiguous stretches of MODEL's structure whose
distance contacts differ by less than BOUND angstroms from
those in REFERENCE. Each interval is at least of size SIZE.

There are some default parameters :

      my @good_regions_of_model =
          dmat_b_cliques $model, $native;

Returns the flat regions longer than 9 residues where
DME is less than 1 angstrom.

=item dme_thresh FRACTION COORD1 COORD2 THRESHOLD

Compare the distance matrices from COORD1 and COORD2. Calculate
the fraction of the distance matrix left after setting a
threshold of THRESHOLD. Put this fraction into FRACTION.

Returns the useful answer in FRACTION, but returns undef on
error.

DME stands for "distance matrix error", a name coined in S<Havel, TF>,
Biopolymers, 29, 1565-1585 (1990), but it is really the root mean
square difference of distance matrices. To compare structures we
calculate the DME based on alpha carbon coordinates. We then go
back to the distance original distance matrices and find the most
different matrix elements and enter a loop:

  while (DME > threshold)
      remove most different element from calculation
  return the fraction of the distance matrix remaining

If the two sets of coordinates were very similar, then we did not
have to remove any matrix elements, so we return a value near
1.0. If the coordinates were very different, we had to throw away
most matrix elements, so we return a number closer to 0.0.

A good value for THRESHOLD is about 3.5 or 4.0 Angstrom. This
measure of similarity is bounded by 0 and 1.0 and is not too
sensitive to protein size.

=item dme_nice MODEL REF BOUND SIZE CORE CORE_BORDER

Returns a list of intervals defining bounded regions of
the difference of distance matrices between MODEL and REF.
Whereas in dme_b_cliques, the intervals define bounded regions,
the 'nice' regions defined by this function allow for some
deviations at the boundary of the flat regions of a DME plot.
SIZE refers to the structural fragment size, and CORE refers
to the minimum size of a bounded region expressed as the number
of overlapping fragments.
CORE_BORDER is the size of the boundary region where deviation
is allowed. Essentially this means that pairs of intervals that
would be returned by dmat_b_cliques will be merged if they are
separated by fewer than CORE_BORDER residues.

=item dssp COORD

Run the DSSP program on the coordinates in COORD.  Store the
answer in COORD.

=item make_model PAIR_SET SEQ COORD

Given an alignment in PAIR_SET (a B<Pair_set> pointer) which
comes from sequence SEQ aligned to coordinates COORD, return a
B<Coord> pointer with a model of the sequence on the coordinates.

=item model_pdb_num COORD MODELNUM

Given a B<COORD OBJECT> and a residue position, returns the
original PDB number stored for that residue.

=item model_res_num COORD RESIDUENUM

Given a B<COORD OBJECT> and a residue number (according to PDB
sequence numbering), returns the index for that residue in the
coord object, or undefined if there is no coordinates for that
sequence number.

=item num_seq SEQ_ARRAY

Returns the number of sequences in a sequence array.

=item pair_set_coverage PAIR_SET SIZE1 SIZE2

Given an alignment, we often want to know what part of the
sequence or structure is accounted for. For example

 A B C - - D E - F G H
     Q R S T U V W - Y Z

In the first string, you could imagine the residues we
know about are like

 0 0 1 1 1 1 0 1

while in the second string, the pattern is like

 1 0 0 1 1 0 1 1

Call this function with a pair set and the size of the first and
second objects (sequences or sequence/structure pair).
pair_set_coverage returns two lists. One for each member of the
alignment respectively.

This returns a pair of strings. In each string, a '0' char means
the site is not covered by the alignment, but a '1' means it
is.  On failure, returns undef. A typical use would be

 my ($s1, $s2) =
     pair_set_coverage ($pair_set, seq_size ($seq), coord_size ($coord);
 print "Pattern of sequence coverage looks like\n$s1\n";
 print "Pattern of struct coverage looks like \n$s2\n";

and the output strings might look like

  00100111

Which would mean that the 3rd, 6th and 7th positions were
properly aligned.  Note also that this leaves room for some
tricks.
Strings can be binary &'d to look for overlap and '1's can be
quickly counted using perl matching operators.

=item pair_set_extend PAIR_SET $EXT_LONG

=item pair_set_extend PAIR_SET $EXT_SHORT

If you have done a Smith and Waterman style alignment, you
may not have all residues aligned. Consider

   A  B  C  D  Q  R  S  E
         X  Y  Q  R  S  X  Y

The alignment from a highest scoring segment is

               Q  R  S
               Q  R  S

and this is what we will likely get if we don't ask
otherwise.
but we could reasonably do a short extension to include
residues which go out to the ends of the shorter sequence.

         C  D  Q  R  S  E
         X  Y  Q  R  S  X

One may also want a longer extension, which would look like

   A  B  C  D  Q  R  S  E  -
   -  -  X  Y  Q  R  S  X  Y

Given these definitions of extension, (long and short), you
may want either result.

The symbols, C<$EXT_LONG> and C<$EXT_SHORT> are constants. You
should choose one of them.

=item pair_set_gap   PAIR_SET SCALE_OPEN SCALE_WIDEN

Like C<coord_geo_gap>, this is for calculating extra gap
penalties after doing an alignment. This returns a gap penalty
for the sequence, not the structure.  It returns two values like

  my ($open_penalty, $widen_penalty);
  ($open_penalty, $widen_penalty)
     = pair_set_gap($pair_set, $o_scl, $w_scl);

For each gap, the penalty is calculated as

  penalty = open_scale + (n - 1) * widen_scale

where C<n> is the length of the gap.  Because these things are
linear, you can get the number of gaps in a whole alignment from
S<C<n_gap = $open_penalty / $open_scale>>.

=item pair_set_score PAIR_SET

This returns two floating point numbers. The first is the score
from an alignment, including gaps. The second is the score, not
including gaps. The intentions are to

=over

=item *

Let one look at the pure sequence-structure score and

=item *

Let one use a more sophisticated gap scoring scheme than the one
used in the alignment calculation.

=back

For example, we have have something like

 my $pair_set = score_mat_sum_sec (.....);
 my ($score, $simple_score) = pair_set_score ($pair_set);
 print "Score with gaps is ", $score,
       "without gaps is ", $simple_score, "\n";

For compatibility with old scripts, it is still OK to say

 my $score = pair_set_score ($pair_set)

=item pair_set_string PAIR_SET SEQ1 SEQ2

Return a string for printing. PAIR_SET is an alignment
corresponding to SEQ1 and SEQ2. If this came from a sequence
to structure alignment, we still want the sequence
corresponding to SEQ2. This is easy to get from

   coord_get_seq (C)

but this routine should be renamed C<seq_from_coord()>.

=item pair_set_pretty_string PAIR_SET SEQ1 SEQ2 SEC_S_DATA COORD2

=item pair_set_pretty_string PAIR_SET SEQ1 SEQ2


Return a string with a pretty alignment.

If you have secondary structure data (SEC_S) which has probably
come from a predictor and if you provide the coordinates
(COORD2), then the output will include this secondary structure
information.

=item param_fx_read FNAME

Read parameters for Thomas' Bayesian scoring / alignment
functions.
Returns an FXParam object on success, or undefined on failure.

=item param_rs_read FNAME

Read parameters for the tanh() based score function with all
atoms. This is for rescoring, not alignment.
Returns an RSParam object on success, or undefined on failure.


=item pdb_read FNAME ACQ_CODE CHAIN

Read the pdb file, FNAME, and return a Coord object. Note that
this will not include secondary structure information.  ACQ_CODE
is the acquisition code. It can be blank, in which case, the
program will try to guess the code from the filename. CHAIN can
also be blank, or a single letter, chain specifier.

This function is for rewriting lots of files, for example, when
we rebuild a library. This means it tolerates errors rather than
stopping. Furthermore, it contains a list of residue name
substitutions. For example, it will change

 original new_name   comment
 ---------------------------
 UNK      ALA        Unknown residues become alanine
 MSE      MET        Selenomethionine becomes methionine
 CSE      CYS        Selenocysteine becomes cystein

There are currently more than 20 of these conversions. They are a
one way affair. After reading the coordinates, the original name
is thrown away.


=item prob_vec_info PROB_VEC

B<PROB_VEC> is a matrix of class probabilities. This returns a
string containing some minimal information. Functions like
L<"seq_2_prob_vec"> return a probability vector.

=item scor_set_simpl PAIR_SET SCORE_MAT

Returns a B<Scor_set> object, corresponding to the local
score for each aligned pair of residues in PAIR_SET, based on
the scores in SCORE_MAT. These local scores, when summed, yield
the 'simple score' returned by B<pair_set_score>.

The obvious use of this function is to generate an annotated
PDB file :

  coord_2_spdb ("1mypC.pdb", make_model($pair_set, $seq, $template)
                           , scor_set_simpl($pair_set,
                                            $tot_score_mat));


Where the temperature entry of the each residue in the model will
be set to the local score given in the score matrix.


=item scor_set_fromvec @SCORES

Returns a B<Scor_set> object containing the vector of scores given
in @SCORES.


=item scor_set_to_vec $scor_set

Returns a vector of doubles corresponding to the scores contained
in the B<Scor_set> object.  This is useful when you want to actually
print out the scores for an alignment :

    # float_to_character maps a signed float to a
    # scale of alphanumerics

    sub d2c( $ ) { return float_to_character ( $_ ) };

    my $i=0;
    my $cs = pair_set_coverage( $pairset, $seq_len, $coord_len );
    my $scor = scor_set_to_vec( $scor_set); # from $pair_set
    $cs=~ s/0/./g;
    $cs=~ s/1/d2c($$scor[$i++]);/ge;

    # $cs is now a coverage string where every aligned position is
    # scored according to float_to_character.


=item scor_set_scale SCOR_SET SCALE

Returns true if it managed to divide each entry in the B<Scor_set> by
the scalar in SCALE.

=item score_fx MATRIX SEQ COORD PARAM

Fill out a score matrix for a sequence structure pair, based on
Herr Doktor Huber's Bayesian-based score function. This score
function works for sequence to structure alignments.

MATRIX is a new created score matrix, probably returned from
B<score_mat_new>. SEQ is a sequence object (SeqPtr) COORD is a
coordinate object (CoordPtr). PARAM are parameters from
C<param_fx_read>. They have type FXParam and this type checking
is enforced by the interpreter.

=item score_fx_prof MATRIX SP COORD PARAM

This is the same as score_fx, but the second argument, C<SP>, is
a sequence profile, rather than a sequence.

=item score_mat_add (MAT1, MAT2, SCALE, SHIFT)

Return a new score matrix where each element is given by

 NEW_MAT = MAT1 + ( SCALE * MAT2 + SHIFT)

If only three arguments are given, SHIFT is set to zero.

=item score_mat_info MAT

Given a Score_mat object in MAT, return a list of the form
  (MIN, MAX, AV, STD_DEV)
Ignore the first and last row and column, since they are used for
treatment of end gaps.

Note, the calling procedure for this function changed in S<May
2006>, so old scripts may break;

=item score_mat_new N_ROWS N_COLS

Create a new score matrix and return a Score_mat object.
In a sequence to structure alignment, N_ROWS is the size of
the sequence, N_COLS is the size of the structure. So, if
C<$seq> is a sequence and C<$coord> is a structure, we might
have

  $scr_mat = score_mat_new (seq_size ($seq), coord_size ($coord));

You B<are> entitled to assume that the new matrix is properly
zeroed. Code should not bother to zero it again.

=item score_mat_scale MAT, SCALE

Multiply all elements in MAT by SCALE. This returns a new matrix.

=item score_mat_shift (MAT1, SHIFT)

Return a new score matrix where each element is given by

 NEW_MAT = MAT1 + SHIFT

The first and last row and column are not shifted. This is one
way to implement end gaps.  This function returns a new matrix.


=item score_pvec MATRIX, PVEC1, PVEC2

Fill out the score matrix, B<MATRIX> based on comparing the
class membership probability vectors B<PVEC1> and B<PVEC2>.
These probability vectors check the length of the original
fragments and will return B<undef> if they do not match.

=item score_mat_read FNAME

Go to F<FNAME> and read the score matrix that was probably
written by L<score_mat_write>. Return a new score matrix or undef
on failure like

 my $scr_mat = score_mat_read ($fname);
 if (! $scr_mat) {
     print STDERR "Reading score matrix from $fname failed.\n"; }

=item score_mat_string SCORE_MAT SEQ1 SEQ2

This is mainly for debugging. After calling a score function,
this will return a string containing the score matrix with the
sequence at the top and left hand side. If you did a sequence
to structure alignment, you should still call it with the
sequences.

=item score_mat_sum_full RMAT, SCORE_MAT, PGAP_OPEN, PGAP_WIDEN, QGAP_OPEN, QGAP_WIDEN,P_MULT, Q_MULT, ALIGN_TYPE

=item score_mat_sum_full RMAT, SCORE_MAT, PGAP_OPEN, PGAP_WIDEN, QGAP_OPEN, QGAP_WIDEN,P_MULT, Q_MULT, ALIGN_TYPE, BIAS_SET

This does the summation of a score matrix. Returns non-zero on
success and zero on failure.
Note, there is an optional last argument.
The parameters are

=over 8

=item RMAT

This is used for returning the summed matrix. The function
does not overwrite SCORE_MAT. Instead it returns a new
matrix. This can generally be ignored.

=item SCORE_MAT

This is the score matrix which came from a call to something
like B<score_smat>.

=item PGAP_OPEN

Penalty for opening gaps in the first of the pair of sequences
or structures or whatever.

=item PGAP_WIDEN

Penalty for extending the PGAPs.

=item QGAP_OPEN

Penalty for opening gaps in the second of the pair.

=item QGAP_WIDEN

Penalty for extending the QGAPs

=item P_MULT

This is an object of C<FloatPtr> type with site specific
coefficients for gap penalties and will be applied to the
P_GAP. Typically, this will be for extra penalties for
secondary structure gaps.

=item Q_MULT

As for P_MULT, but for the q_gaps.

=item ALIGN_TYPE

Either "$N_AND_W" or "$S_AND_W" for Needleman and Wunsch or
Smith and Waterman respectively.

=item BIAS_SET

This is a pair_set which may have come from a previous
alignment. It really is an object of Pair_setPtr type.  It is
optional. If present, the alignment will be coerced to follow
this alignment.  The use is that one can do an Smith and Waterman
alignment and get the best scoring segment. One can then set the
ALIGN_TYPE switch to C<$N_AND_W> and do a globally optimal
alignment, but passing through the optimal segment.

=back

=item score_mat_write SCORE_MAT FNAME

Write the score matrix SCORE_MAT to the file name given by
FNAME. Return undef on failure.


=item score_rs COORD PARAMS

This will apply the tanh() based score function which is
primarily for rescoring. It is B<NOT> neighbour-non-specific
or unusual in any other way.

COORD is a Coord object.
PARAMS is a set of parameters, formally of type RSParams and
probably resulting from a call to param_rs_read.

Returns a floating point number.

=item score_smat SCORE_MAT, SEQ1, SEQ2, SUB_MAT

Fill out a score matrix based on sequence comparison of two
sequences.
SCORE_MAT is a score matrix which may have come from
new_matrix.
SEQ1 and SEQ2 are the sequences.
SUB_MAT is a substitution matrix.

=item score_sprof SCORE_MAT, PROF, SEQ, SUB_MAT

This is very similar to B<score_smat>.
SCORE_MAT is a score matrix, probably from B<new_matrix>.
PROF is a sequence profile, maybe from B<blst_chk_read>.
SEQ is a sequence and SUB_MAT is a substition matrix.
This fills out a score matrix using the substition matrix it has
been given and fractional sequences from the sequence profile.

=item score_sec_s SCORE_MAT, SEC_S_DATA, COORD

Given a SCORE_MAT which probably came from C<score_mat_new>, and
some secondary structure data in SEC_S_DATA, probably from
C<sec_s_data_read> and a set of coordinates in COORD, do a
scoring.

We compare each site against the predicted secondary
structure. If the psi angle difference is more than 90 degrees,
we set it to 90. We then return cos (difference). Note, this is
different behaviour to earlier versions. If an angle is very
different (diff > 90), it it not penalised. If the angle is less
than 90, a value will be returned between 0 and 1.0

=item sec_s_data_read FNAME

Read secondary structure data from FNAME.
This may be in PHD/Rost format or our
L<"manual input format"/item_manual_secondary_structure_file>
format defined below.
Return a pointer to an object of Sec_s_data type.

=item sec_s_data_string SEC_S

SEC_S is a pointer to a Sec_s_data object.
Return a string with the bare secondary structure
information. For example:

    $x = sec_s_data_read ("sec_data_file")
           || die "Fail on $tmp_data: $!";
    print sec_s_data_string($x);

=item seq_2_prob_vec SEQ, AA_CLSSFCN
Given a sequence, B<SEQ> and an amino acid classification,
B<AA_CLSSFCN> return an object of B<PROB_VEC> type. This is most
interesting for feeding to a score function like B<score_pvec>.

=item seq_read FNAME

Read a sequence from file FNAME.
Returns a SeqPtr object.

=item seq_read_many FNAME

Reads a number of sequences from file FNAME.
Returns a SeqArrayPtr object (array of sequences).


=item seq_from_string STRING

Returns a sequence object C<SeqPtr> from a string. This lets
one build a sequence in the interpreter like

 $seq = seq_from_string ('avlc');

Would store a sequence (Ala, Val, Leu, Cys) in the $seq
variable.  The string can have a FASTA style comment at the
start. The string can span multiple lines.

=item seq_get_1 SEQ_ARRAY N

Return the N'th string from a sequence array, SEQ_ARRAY.
Returns a SeqPtr object.

=item seq_num SEQ_ARRAY

Returns the number of sequences stored in the sequence array,
SEQ_ARRAY.

=item seq_print SEQ

Return a sequence to the interpreter as a string. It is not
pretty and will be fixed.

=over 8

=item *

The sequence will be beautified so it comes out in blocks of
ten residues, probably with numbering.

=back

=item seq_print_many SEQ_ARRAY

Returns a string containing all the sequences in SEQ_ARRAY.

=item seq_size SEQ

Return the number of residues in a sequence, SEQ.

This is the number of residues in a sequence. There are no
complications or frills like null terminators.

=item seqprof_get_seq SEQPROF

This is analogous to coord_get_seq. It takes a sequence profile
and returns a B<Seq> sequence.

=item seqprof_str SEQPROF

Take the sequence profile in the SEQPROF object and return a
printable string. For example

    if ( ! ($profile = blst_chk_read ( $fname))) {
        return undef; }
    print seqprof_str ($profile);


=item struct_2_prob_vec COORD CLSSFCN
Computes the memberships matrix (probability vector) for a
structure COORD given a CLSSFCN


=item sub_mat_get_by_i SUB_MAT, I<M>, I<N>

Return the substition matrix element indexed by B<M> and B<N>

=item sub_mat_get_by_c SUB_MAT, I<A>, I<B>

Return the substitution matrix element for the amino acids I<A>
and I<B>. So S<C<$a = sub_mat_get_by_c ($sub_mat, 'c', 'd');>>
would return the current value for cys/asp. The amino acid names
are single letter codes and may be upper or lower case.

=item sub_mat_set_by_c SUB_MAT, I<A>, I<B>, I<val>

Set the substitution matrix element for I<A> and I<B> to
I<val>.

=item sub_mat_set_by_i SUB_MAT, I<M>, I<N>, I<val>

Set the value indexed by I<M>,I<N> in the substition matrix to I<val>.

=item sub_mat_string SUB_MAT

Return a string containing the S<substitution / score> matrix
held in SUB_MAT.

=item sub_mat_read (FILENAME)

Go to FILENAME. Read up the substitution / score matrix and
return it.

=item sub_mat_shift SUBST_MATRIX, BOTTOM

Given a substitution matrix (an object of type Sub_mat), shift
the whole matrix so the smallest (most negative value) is of
size BOTTOM.  This does not return anything. It acts on the
SUBST_MATRIX argument directly.

=item sub_mat_scale SUBST_MATRIX, BOTTOM, TOP

Given a substition matrix, SUBST_MATRIX, scale and shift it so
the minimum and maximum values run from BOTTOM to TOP.

=item score_mat_sum_smpl NEW_MAT SCORE_MAT PGAP_OPEN PGAP_WIDEN QGAP_OPEN QGAP_WIDEN ALIGNMENT_TYPE

We have a score matrix which could be from sequence/sequence,
sequence/structure or whatever. Now, do the dynamic
programming work. Sum the score matrix and return a set of
pairs.

The parameters are

=over

=item NEW_MAT

This is a fresh matrix with the traced back scores in it.

=item SCORE_MAT

This is the score matrix.

=item GAP_OPEN

=item GAP_WIDEN

=item ALIGNMENT_TYPE

There are only two values allowed, either

  $N_AND_W

or

  $S_AND_W

These stand for "Needleman and Wunsch" and "Smith and
Waterman" respectively.  Any other value will cause an error.

=back

=item svm_rs_cdata MODEL NATIVE SCOR_SET RS_PARAM CVTYPE

*EXPERIMENTAL!*

The function returns an array of training vectors suitable
for use in training a support vector machine (libSVM.pm) or
some other machine learning procedure. Its form is :

  [ [label_class, [(feature vector)], .. ]

The scheme for calculating the training vectory is given
in CVTYPE, and the data is formed from the local sequence
to structure scores as given by SCOR_SET, the TANH forcefield
based pairwise interaction terms (calculated via RS_PARAM),
and the local model consistency (based on the difference of
distance matrices computed between MODEL and NATIVE).

Scheme 0 works as follows :
(see scoranlys.c:get_svmdata for details at the moment).

=item svm_rsfeat MODEL SCOR_SET RS_PARAMS CVTYPE

This returns a set of feature vectors for each position in
MODEL, calculated from local sequence-structure fitness and
residue-specific interaction terms according to the CVTYPE
scheme (see svm_rs_cdata or scoranlys.c for details).  The
form is as follows :

  my @m_fvset = svm_rsfeat MODEL, SCOR_SET, PARAMS, 0
  @m_fvset is of form
    [ (undef), [feature vector], .., .., (undef)]
  and (scalar @m_fvset) == coord_size(MODEL)

undefs are given for positions in the model where a full
feature vector cannot be computed (at the ends, for instance).


=back

=head1 BUILD AND INSTALL

Wurst is mostly migrated to the standard perl build system. This
means you should go the top level directory and try

   perl Makefile.PL
   make
   make install

This would try to install into the system directories, so a
better tactic might be

   perl Makefile PREFIX=~/myperl LIB=~/myperl/lib
   make

This usually builds without problem. Check with scripts from the
F<t> directory. Then install into your own account with

  make install

Your scripts would then have to start with lines like

   use lib "$ENV{HOME}/myperl/lib

If this looks OK, you might

  cd scripts
  perl hello.pl

This will check if Wurst pieces appear to be in place. If that
looks OK, then edit F<wurst/src/Wurst/Makefile.PL> to set the
installation destination.  Then

  make install

from the top level directory. Then go back to the scripts
directory and try a different file like

  cd scripts
  perl hello2.pl

=head1 FILE FORMATS

=over

=item PHD secondary structure files

Wurst can read secondary structures from the PHD prediction
server. The format is empirically defined.  That means we try
to read anything that comes from the server.

=item manual secondary structure file

One may type in secondary structure predictions or assignments.
The format is like:

  # speculative secondary structure assignments
  secondary structure
  5 - 30 h
  37 e
  40-45 e 5  # This is a very unreliable guess

=over

=item *

This means that residues 5 to 30 are helix.
Residue 37 is sheet (extended).
Residues 40 to 45 are sheet (extended) and they have a confidence
level of 5.

=item *

The first non blank line must say "secondary structure". This is
not optional. It is used to recognise the file format.

=item *

Confidence levels can be given from 0 to 9. Zero means there is
no confidence. 9 means you are very confident. This number must
be an integer. There is no way to give a more detailed number.

=item *

Confidence levels are optional. The default is confidence=9.

=item *

Anything after a hash C<#> is a comment.

=item *

Blank lines are ignored.

=back

=back

=head1 AUTHORS

Alphabetically...
Steve Hoffmann,
Thomas Huber,
Tina Lai,
Nasir Mahmood,
Thomas Margraf, Martin Mosisch, James B. Procter,
Gundolf Schenk,
Andrew E. Torda.

=head1 SEE ALSO

coding.pod contains rules for adding to wurst.

=cut