ncbi_2_gff.pl - Massage NCBI chromosome annotation into GFF-format suitable for Bio::DB::GFF
$RCSfile: process_ncbi_human.pl,v $ $Revision: 1.1 $ $Author: lstein $ $Date: 2008-10-16 17:01:27 $
perl process_ncbi_human.pl [options] /path/to/gzipped/datafile(s)
This script massages the chromosome annotation files located at
into the GFF-format recognized by Bio::DB::GFF. If the resulting GFF-files are loaded into a Bio::DB:GFF database using the utilities described below, the annotation can be viewed in the Generic Genome Browser (http://www.gmod.org/ggb/) and interfaced with using the Bio::DB:GFF libraries. (NB these NCBI-datafiles are dumps from their own mapviewer database backend, according to their READMEs)
To produce the GFF-files, download all the chr*sequence.gz files from the FTP-directory above. While in that same directory, run the following example command (see also help clause by running script with no arguments):
process_ncbi_human.pl --locuslink [path to LL.out_hs.gz] chr*sequence.gz
This will unzip all the files on the fly and open an output file with the name chrom[$chrom]_ncbiannotation.gff for each, read the LocusLink records into an in-memory hash and then read through the NCBI feature lines, lookup 'locus' features in the LocusLink hash for details on 'locus' features and print to the proper GFF files. LL.out_hs.gz is accessible here at the time of writing:
Note that several of the NCBI features are skipped from the reformatting, either because their nature is not fully known at this time (TAG,GS_TRAN) or their sheer volume stands in the way of them being accessibly in Bio::DB::GFF at this time (EST similarities). You can easily change this by modifying the $SKIP variable to your liking to add or remove features, but if you add then you will have to add handling for those new features.
To bulk-import the GFF-files into a Bio::DB::GFF database, use the bulk_load_gff.pl utility provided with Bio::DB::GFF
Gudmundur Arni Thorisson <firstname.lastname@example.org>
Copyright (c) 2002 Cold Spring Harbor Laboratory
This code is free software; you can redistribute it and/or modify it under the same terms as Perl itself.