The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

MassSpec::CUtilities - Perl extension containing C utilities for use in mass spectrometry

SYNOPSIS

  # MassSpec::CUtilities is an XS module so there's a chance that you or
  # your target user might not get it to install cleanly on the target system;
  # therefore it's recommended to make its presence optional and to offer
  # slower-performing Perl equivalents where practical.
  # 
  # Also note that this module uses a 19-letter amino alphabet rather than
  # the traditional 20-letter alphabet, since the isobars Leucine(L) and
  # Isoleucine(I) are represented instead by "X."  Furthermore some portions
  # of this module assume that their input peptides are internally in
  # alphabetical order.
  my $haveCUtilities;
  if (eval 'require MassSpec::CUtilities') {
          import MassSpec::CUtilities;
          $haveCUtilities = 1;
  } else {
          $haveCUtilities = 0;
  }
  if ($haveCUtilities) {
          my $candidate = MassSpec::CUtilities::encodeAsBitString("ACCGT");
          my @shortPeptides = ("ACT","CCGM","ACCGTY","CCT");
          my (@list,@answer);
          foreach $_ (@shortPeptides) {
                  push @list, MassSpec::CUtilities::encodeAsBitString($_);
          }
          if (MassSpec::CUtilities::testManyBitStrings($candidate,\@shortPeptides,\@list,\@answer)) {
                  # should print "ACT" and "CCT" only
                  print "Matched: " . join(',',@answer) . "\n";
          }
   }
   # see API documentation for other available subroutines

ABSTRACT

  An eclectic mix of C utilities originally used in a mass spectrometry
  denovo sequencing project at NIH.  It includes a fast Huffman decoder
  suitable (with minor modifications) for use with the CPAN module
  Algorithm::Huffman, as well as a fast peptide mass calculator and methods for
  encoding peptides as products of prime numbers or as bitmaps.

DESCRIPTION

An eclectic mix of C utilities originally used in a mass spectrometry denovo sequencing project at NIH. It includes a fast Huffman decoder suitable (with minor modifications) for use with the CPAN module Algorithm::Huffman, as well as a fast peptide mass calculator and methods for encoding peptides as products of prime numbers or as bitmaps.

PROGRAMMER'S INTERFACE (API)

computePeptideMass(peptide)

Compute the mass of a peptide in daltons.

initDecoderTree(Huffman_code,values)

Initialize a Huffman decoder tree given a Huffman code and the delta-values associated with the corresponding code. This could be easily generalized for a more common use of Huffman trees, mapping Huffman codes to the actual decoded values. However, since AACs are always guaranteed to be sorted, we're encoding the deltas (differences) between consecutive characters rather than the character itself. See the code for more details.

fast_decode(encoded,len)

Perform Huffman decoding, assuming that the decoder tree has already been initialized by a call to initDecoderTree().

bitsAvailableForPrimeProducts()

Report the number of bits available for computing prime products based upon the local hardware architecture and C compiler support. In general, less than 64 bits isn't very useful, and anything greater than 64 bits would be fantastic.

computePrimeProduct(peptide)

Compute an encoded value for a peptide, using a code which maps each amino acid to one of the lowest 19 prime numbers and then computes their (multiplicative) product. Returns undef if there aren't enough available bits to encode this prime.

(N.B.: one might do better using 71 instead of 2 and thereby be able to perform arithmetic modulo 2**N, where N might typically be 16 or 32)

testManyPrimeProducts(product,peptides,multiples,answer)

Perform bulk testing of a single encoded peptide-product versus a collection of similarly encoded products, to see which of the set <@multiples> are exact divisors of $product, which corresponds to a string being an extension of one of the strings in @peptides. Any successful hits are logged in the @answer, and the hit count is returned as the subroutine's value.

computeManyPrimeProducts(peptides,multiples)

Like computePrimeProduct, but attempts to compute these products for multiple peptides and terminates with a non-positive value if any of these computations fail due to a lack of bits in one of the prime products.

encodeAsBitString(peptide)

Yet another approach to comparing whether one peptide is an extension of one or more of a set of other peptides. Here we determine the maximum number of each amino acid which can "fit" in a peptide of a given maximum mass (currently hard-coded as 2000 daltons). Then we encode one bit for each of the possible occurrences of each amino acid (usually wasting a lot of bits).

testManyBitStrings(encodedCandidate,peptides,bitstrings,answer)

Test the bitstrings encoded in encodeAsBitString() against one another in a manner analogous to testManyPrimeProducts().

quickAACLookup(minmass,maxmass,answer)

For a small in-memory database (AACs up to 4 residues long), find all the AACs associated with a mass range.

binarySearchSpectrum(mass,extended_spectrum)

Perform a binary search of extended_spectrum, and return the index of the entry just greater than mass.

THANKS TO

Assaf Rahav (rahav_assaf@yahoo.com) who suggested the prime product idea
Jack Chen (xchen@helix.nih.gov) who wrote the C code for portions of the Huffman decoder

SEE ALSO

Spengler, B. De novo sequencing, peptide composition analysis, and composition-based sequencing: a new strategy employing accurate mass determination by fourier transform ion cyclotron resonance mass spectrometry. J Am Soc Mass Spectrom. 2004 May;15(5):703-14.

Any good computer science textbook about data structures

TODO: add journal citation(s) here

AUTHOR

Jonathan Epstein, <Jonathan_Epstein@nih.gov>

COPYRIGHT AND LICENSE

                          PUBLIC DOMAIN NOTICE

        National Institute of Child Health and Human Development

 This software/database is a "United States Government Work" under the
 terms of the United States Copyright Act.  It was written as part of
 the author's official duties as a United States Government employee and
 thus cannot be copyrighted.  This software/database is freely available
 to the public for use. The National Institutes of Health and the U.S.
 Government have not placed any restriction on its use or reproduction.

 Although all reasonable efforts have been taken to ensure the accuracy
 and reliability of the software and data, the NIH and the U.S.
 Government do not and cannot warrant the performance or results that
 may be obtained by using this software or data. The NIH and the U.S.
 Government disclaim all warranties, express or implied, including
 warranties of performance, merchantability or fitness for any particular
 purpose.
 

Please cite the author in any work or product based on this material.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.