dbfetch - generic CGI program to retrieve biological database entries in various formats and styles (using SRS)
# URL examples: # prints the interactive page with the HTML form http://www.ebi.ac.uk/cgi-bin/dbfetch # for backward compatibility, implements <ISINDEX> # single entry queries defaulting to EMBL sequence database http://www.ebi.ac.uk/cgi-bin/dbfetch?J00231 # retrieves one or more entries in default format # and default style (html) # returns nothing for IDs which are not valid http://www.ebi.ac.uk/cgi-bin/dbfetch?id=J00231.1,hsfos,bum # retrieve entries in fasta format without html tags http://www.ebi.ac.uk/cgi-bin/dbfetch?format=fasta&style=raw&id=J00231,hsfos,bum # retrieve a raw Ensembl entry http://www.ebi.ac.uk/cgi-bin/dbfetch?db=ensembl&style=raw&id=AL122059
This program generates a page allowing a web user to retrieve database entries from a local SRS in two styles: html and raw. Other database engines can be used to implement te same interfase.
At this stage, on unique identifier queries are supported. Free text searches returning more than one entry per query term are not in these specs.
In its default setup, type one or more EMBL accession numbers (e.g. J00231), entry name (e.g. BUM) or sequence version into the seach dialog to retieve hypertext linked enties.
Note that for practical reasons only the first 50 identifiers submitted are processed.
Additional input is needed to change the sequence format or suppress the HTML tags. The styles are html and raw. In future there might be additional styles (e.g. xml). Currently XML is a 'raw' format used by Medline. Each style is implemented as a separate subroutine.
A new database can be added simply by adding a new entry in the global hash %IDS. Additionally, if the database defines new formats add an entry for each of them into the hash %IDMATCH. After modifying the hash, run this script from command line for some sanity checks with parameter debug set to true (e.g. dbfetch debug=1 ).
Finally, the user interface needs to be updated in the print_prompt subroutine.
Version 3 uses EBI SRS server 6.1.3. That server is able to merge release and update libraries automatically which makes this script simpler. The other significant change is the way sequence versions are indexed. They used to be indexed together with the string accession (e.g. 'J00231.1'). Now they are indexed as integers (e.g. '1').
Version 3.1 changes the command line interface. To get the debug information use attribute 'debug' set to true. Also, it uses File::Temp module to create temporary files securely.
Version 3.2 fixes fasta format parsing to get the entry id.
Version 3.3. Adds RefSeq to the database list.
Version 3.4. Make this compliant to BioFetch specs.
Title : print_prompt Usage : Function: Prints the default page with the query form to STDOUT (Web page) Args : Returns :
Title : protect Usage : $value = protect($q->param('id')); Function: Removes potentially dangerous characters from the input string. At the same time, converts word separators into a single space character. Args : scalar, string with one or more IDs or accession numbers Returns : scalar
Title : input_error Usage : input_error($q, 'html', "Error message"); Function: Standard error message behaviour Args : reference to the CGI object scalar, string to display on input error. Returns : scalar
Title : no_entries Usage : no_entries($q, "Message"); Function: Standard behaviour when no entries found Args : reference to the CGI object scalar, string to display on input error. Returns : scalar
Title : raw Usage : Function: Retrieves a single database entry in plain text Args : scalar, an ID scaler, format Returns : scalar
Title : html Usage : Function: Retrieves a single database entry with HTML hypertext links in place. Limits retieved enties to ones with correct version if the string has '.' in it. Args : scalar, a UID scalar, format Returns : scalar
Title : xml Usage : Function: Retrieves an entry formatted as XML Args : array, UID scalar, format Returns : scalar
Title : debugging Usage : 'perl dbfetch' Function: Performs sanity checks on global hash %IDS when this script is run from command line. %IDS holds the description of formats and other crusial info for each database accessible through the program. Note that hash key 'version' is not tested as it should only be in sequence databases. Args : none Returns : error messages to STDOUT