View on
MetaCPAN is shutting down
For details read Perl NOC. After June 25th this page will redirect to
Christopher Fields > BioPerl >


Annotate this POD


New  6
Open  4
View/Report Bugs

NAME ^ - extract genomic sequences from NCBI files using BioPerl


This script is a simple solution to the problem of extracting genomic regions corresponding to genes. There are other solutions, this particular approach uses genomic sequence files from NCBI and gene coordinates from Entrez Gene.

The first time this script is run it will be slow as it will extract species-specific data from the gene2accession file and create a storable hash (retrieving the positional data from this hash is significantly faster than reading gene2accession each time the script runs). The subsequent runs should be fast.


Install BioPerl, full instructions at

Download gene2accession.gz

Download this file from into your working directory and gunzip it.

Download sequence files

Create one or more species directories in the working directory, the directory names do not have to match those at NCBI (e.g. "Sc", "Hs").

Download the nucleotide fasta files for a given species from its CHR* directories at and put these files into a species directory. The sequence files will have the suffix ".fna" or "fa.gz", gunzip if necessary.

Determine Taxon id

Determine the taxon id for the given species. This id is the first column in the gene2accession file. Modify the %species hash in this script such that name of your species directory is a key and the taxon id is the value.

Command-line options

  -i   Gene id
  -s   Name of species directory
  -h   Help

Example: -i 850302 -s Sc
syntax highlighting: