Alexandre Masselot > InSilicoSpectro-Databanks >


Annotate this POD


Open  0
View/Report Bugs

NAME ^ - decoy input databanks following several moethods


Reads input fasta file and produce a decoyed databank with several methods:

reverse: simply reverse each sequence
shuffle: shuffle AA in each sequence
shuffle & avoid known cleaved peptides: shuffle sequence but avoid producing known tryptic peptides
Markov model: learn Markov model chain distribution of a given level, then produces entries corresponding to this distribution


  #reverse sequences for a local (optionaly compressed) file --in=/tmp/uniprot_sprot.fasta.gz --method=reverse

  #download databanks from the web | uncompress it and shuffle the sequence
  wget -silent -O - | zcat | --method=shuffle

  #use a .dat file (with splice forms) as an input --in=uniprot_sprot_human.dat | --method=markovmodel

  #reversing each sequence --ac-prefix=DECOY_ --in=mitoch.fasta --method=reverse --out=mitoch-reverse.fasta

  #drawing amino acid following distribution in original fasta (end of sequence is considered as a learned random event) --ac-prefix=DECOY_ --in=mitoch.fasta --method=markovmodel --markovmodel-level=0 --out=mitoch-markovmodel_0.fasta

  #drawing amino acid with a markov model (here of length 3) --ac-prefix=DECOY_ --in=mitoch.fasta --method=markovmodel --markovmodel-level=3 --out=mitoch-markovmodel_3.fasta

  #each sequence is randomly shuffled --ac-prefix=DECOY_ --in=mitoch.fasta --method=shuffle --out=mitoch-shuffle.fasta

  #idem, but no tryptic peptide (of length>=6) from the original bank must be found in the random one;
  => see script



An input fasta file (will be uncompressed if ending with gz)


A .fasta file [default is stdout]


Set the decoying method



Set a key to be prepended before the AC in the randomized bank. By default, it will be dependent on the choosen method.

--method=shuffle options

--method=markovmodel options

--markovmodel-level=int [default 3]

Set length of the model (0 means only AA distrbution will be respected, 3 means chains of length 3 distribution etc.). Setting a length >3 can deal to memory burnout.



Random generator seed is set to 0, so 2 run on same data will produce the same result


do not display terminal progress bar (if possible)




Setting an environment variable DO_NOT_DELETE_TEMP=1 will keep the temporay file after the script exit



Copyright (C) 2004-2006 Geneva Bioinformatics

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA


Alexandre Masselot,

syntax highlighting: