The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
     TAGS USED IN BOULDER REPRESENTATION OF GENBANK NUCLEOTIDE RECORDS
			    August 3, 1998
			   Lincoln D. Stein

Last modified: November 12, 1998

INTRODUCTION
------------

The boulder format is used by the Boulder::Genbank module, as well as
by the gb_search and gb_fetch programs, to retrieve and parse Genbank
entries from NCBI as well as from local files.  This document
describes the tags that are returned in the boulder stream from
Boulder::Genbank.

DEFINED TAGS
------------

    The tags returned by the parsing operation are taken from the NCBI ASN.1
    schema. For consistency, they are normalized so that the initial letter
    is capitalized, and all subsequent letters are lowercase. This section
    contains an abbreviated list of the most useful/common tags. See "The
    NCBI Data Model", by James Ostell and Jonathan Kans in "Bioinformatics:
    A Practical Guide to the Analysis of Genes and Proteins" (Eds. A.
    Baxevanis and F. Ouellette), pp 121-144 for the full listing.

  Top-Level Tags
  --------------

    These are tags that appear at the top level of the parsed Genbank entry.

    Accession
        The accession number of this entry. Because of the vagaries of the
        Genbank data model, an entry may have multiple accession numbers
        (e.g. after a merging operation). Accession may therefore be a
        multi-valued tag.

        Example: my $accessionNo = $s->Accession;

    Authors
        The list of authors, as they appear on the AUTHORS line of the
        Genbank record. No attempt is made to parse them into individual
        authors.

    Basecount
        The nucleotide basecount for the entry. It is presented as a Boulder
        Stone with keys "a", "c", "t" and "g". Example:

             my $A = $s->Basecount->a;
             my $C = $s->Basecount->c;
             my $G = $s->Basecount->g;
             my $T = $s->Basecount->t;
             print "GC content is ",($G+$C)/($A+$C+$G+$T),"\n";

    Comment
        The COMMENT line from the Genbank record.

    Definition
        The DEFINITION line from the Genbank record, unmodified.

    Features
        The FEATURES table. This is a complex stone object with multiple
        subtags. See the the section on "The Features Tag" for details.

    Journal
        The JOURNAL line from the Genbank record, unmodified.

    Keywords
        The KEYWORDS line from the Genbank record, unmodified. No attempt is
        made to parse the keywords into separate values.

        Example:

            my $keywords = $s->Keywords

    Locus
        The LOCUS line from the Genbank record. It is not further parsed.

    Medline, Nid
        References to other database accession numbers.

    Organism
        The taxonomic name of the organism from which this entry was
        derived. This line is taken from the Genbank entry unmodified. See
        the NCBI data model documentation for an explanation of their
        taxonomic syntax.

    Reference
        The REFERENCE line from the Genbank entry. There are often multiple
        Reference lines. Example:

          my @references = $s->Reference;

    Sequence
        The DNA or RNA sequence of the entry. This is presented as a single
        lower-case string, with all base numbers and formatting characters
        removed.

    Source
        The entry's SOURCE field; often giving clues on how the sequencing
        was performed.

    Title
        The TITLE field from the paper describing this entry, if any.

  The Features Tag
  ----------------

    The Features tag points to a Stone record that contains multiple
    subtags. Each subtag is the name of a feature which points, in turn, to
    a Stone that describes the feature's location and other attributes. The
    full list of feature is beyond this document, but the following are the
    features that are most often seen:

            Cds             a CDS
            Intron          an intron
            Exon            an exon
            Gene            a gene
            Mrna            an mRNA
            Polya_site      a putative polyadenylation signal
            Repeat_unit     a repetitive region
            Source          More information about the organism and cell
                            type the sequence was derived from
            Satellite       a microsatellite (dinucleotide repeat)

    Each feature will contain one or more of the following subtags:

    DB_xref
        A cross-reference to another database in the form
        DB_NAME:accession_number. See the NCBI Web site for a description of
        these cross references.

    Evidence
        The evidence for this feature, either "experimental" or "predicted".

    Gene
        If the feature involves a gene, this will be the gene's name (or one
        of its names). This subtag is often seen in "Gene" and Cds features.

        Example:

                foreach ($s->Features->Cds) {
                   my $gene = $_->Gene;
                   my $position = $_->Position;
                   Print "Gene $gene ($position)\n";
                }

    Map If the feature is mapped, this provides a map position, usually as a
        cytogenetic band.

    Note
        A grab-back for various text notes.

    Number
        When multiple features of this type occur, this field is used to
        number them. Ordinarily this field is not needed because
        Boulder::Genbank preserves the order of features.

    Organism
        If the feature is Source, this provides the source organism.

    Position
        The position of this feature, usually expresed as a range
        (1970..1975).

    Product
        The protein product of the feature, if applicable, as a text string.

    Translation
        The protein translation of the feature, if applicable.

EXAMPLE GENBANK OBJECT
----------------------

The following is an excerpt from a moderately complex Genbank Stone. The
Sequence line and several other long lines have been truncated for
readability.

     Authors=Spritz,R.A., Strunk,K., Surowy,C.S.O., Hoch,S., Barton,D.E. and Francke,U.
     Authors=Spritz,R.A., Strunk,K., Surowy,C.S. and Mohrenweiser,H.W.
     Locus=HUMRNP7011   2155 bp    DNA             PRI       03-JUL-1991
     Accession=M57939
     Accession=J04772
     Accession=M57733
     Keywords=ribonucleoprotein antigen.
     Sequence=aagcttttccaggcagtgcgagatagaggagcgcttgagaaggcaggttttgcagcagacggcagtgacagcccag...
     Definition=Human small nuclear ribonucleoprotein (U1-70K) gene, exon 10 and 11.
     Journal=Nucleic Acids Res. 15, 10373-10391 (1987)
     Journal=Genomics 8, 371-379 (1990)
     Nid=g337441
     Medline=88096573
     Medline=91065657
     Features={
       Polya_site={
         Evidence=experimental
         Position=1989 
         Gene=U1-70K
       }
       Polya_site={
         Position=1990 
         Gene=U1-70K
       }
       Polya_site={
         Evidence=experimental
         Position=1992 
         Gene=U1-70K
       }
       Polya_site={
         Evidence=experimental
         Position=1998 
         Gene=U1-70K
       }
       Source={
         Organism=Homo sapiens
         Db_xref=taxon:9606
         Position=1..2155 
         Map=19q13.3
       }
       Cds={
         Codon_start=1 
         Product=ribonucleoprotein antigen
         Db_xref=PID:g337445
         Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
         Gene=U1-70K
         Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPRDAPPPTR...
       }
       Cds={
         Codon_start=1 
         Product=ribonucleoprotein antigen
         Db_xref=PID:g337444
         Evidence=experimental 
         Position=join(M57929:329..475,M57930:183..245,M57930:358..412, ...
         Gene=U1-70K
         Translation=MTQFLPPNLLALFAPRDPIPYLPPLEKLPHEKHHNQPYCGIAPYIREFEDPR...
       }
       Polya_signal={
         Position=1970..1975 
         Note=putative
         Gene=U1-70K
       }
       Intron={
         Evidence=experimental
         Position=1100..1208 
         Gene=U1-70K
       }
       Intron={
         Number=10 
         Evidence=experimental
         Position=1100..1181 
         Gene=U1-70K
       }
       Intron={
         Number=9 
         Evidence=experimental
         Position=order(M57937:702..921,1..1011) 
         Note=2.1 kb gap
         Gene=U1-70K
       }
       Intron={
         Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1208) 
         Gene=U1-70K
       }
       Intron={
         Evidence=experimental
         Position=order(M57935:284..406,M57936:1..284,M57937:1..599, <1..>1208) 
         Note=first gap-0.14 kb, second gap-0.62 kb
         Gene=U1-70K
       }
       Intron={
         Number=8 
         Evidence=experimental
         Position=order(M57935:272..406,M57936:1..284,M57937:1..599, <1..>1181) 
         Note=first gap-0.14 kb, second gap-0.62 kb
         Gene=U1-70K
       }
       Exon={
         Number=10 
         Evidence=experimental
         Position=1012..1099 
         Gene=U1-70K
       }
       Exon={
         Number=11 
         Evidence=experimental
         Position=1182..(1989.1998) 
         Gene=U1-70K
       }
       Exon={
         Evidence=experimental
         Position=1209..(1989.1998) 
         Gene=U1-70K
       }
       Mrna={
         Product=ribonucleoprotein antigen
         Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
         Gene=U1-70K
       }
       Mrna={
         Product=ribonucleoprotein antigen
         Citation=[2] 
         Evidence=experimental 
         Position=join(M57928:358..668,M57929:319..475,M57930:183..245, ...
         Gene=U1-70K
       }
       Gene={
         Position=join(M57928:207..719,M57929:1..562,M57930:1..577, ...
         Gene=U1-70K
       }
     }
     Reference=1  (sites)
     Reference=2  (bases 1 to 2155)
     =