Roger A Hall > NCBIx-BigFetch-0.56 > NCBIx::BigFetch



Annotate this POD

View/Report Bugs
Module Version: 0.5.6   Source  


NCBIx::BigFetch - Robustly retrieve very large NCBI sequence result sets based on keyword searches using NCBI eUtils.


  use NCBIx::BigFetch;
  # Parameters
  my $params = { project_id => "1", 
                 base_dir   => "/home/user/data", 
                 db         => "protein",
                 query      => "apoptosis",
                 return_max => "500" };
  # Start project
  my $project = NCBIx::BigFetch->new( $params );
  # Love the one you're with
  print " AUTHORS: " . $project->authors() . "\n";
  # Attempt all batches of sequences
  while ( $project->results_waiting() ) { $project->get_next_batch(); }
  # Get missing batches 
  while ( $project->missing_batches() ) { $project->get_missing_batch(); }
  # Find unavailable ids
  my $ids = $project->unavailable_ids();
  # Retrieve unavailable ids
  foreach my $id ( @$ids ) { $project->get_sequence( $id ); }


NCBIx::BigFetch is useful for downloading very large result sets of sequences from NCBI given a text query. Its first use had over 11,000,000 sequences as the result of a single keyword search. It uses YAML to create a configuration file to maintain project state in case network or server issues interrupts execution, in which case it may be easily restarted after the last batch.

Downloaded data is organized by "project id" and "base directory" and saved in text files. Each file includes the project id in its name. The project_id and base_dir keys are the only required keys, although you will get the same search for "apoptosis" everytime unless you also set the "query" key. In any case, once a project is started, it only needs the two parameters to be reloaded.

Besides the data files, two other files are saved: 1) the initial search result, which includes the WebEnv key, and 2) a configuration file, which saves the parsed data and is used to pick-up the download and recover missing batches or sequences.

Results are retrived in batches depending on the "return_max" key. By default, the "index" starts at 1 and downloads continue until the index exceedes "count".

Occasionally errors happen and entire batches are not downloaded. In this case, the "index" is added to the "missing" list. This list is saved in the configuration file. The missing batches should be downloaded every day, and not saved until the end of the complete run.

Working scripts are included in the script directory:


The recommended workflow is:

        1. Copy the scripts and edit them for a specific project. Use 
           a new number as the project ID. 

        2. Begin downloading by running fetch-all.pp, which will first 
           submit a query and save the resulting WebEnv key in a project 
           specific configuration file (using YAML).

        3. The next morning, kill the fetch-all.pp process and run 
           fetch-missing.pp until it completes.  

        4. Restart fetch-all.pp.  

If you wish to re-download "not available" sequences, you may run fetch-unavailable.pp. However, they will be downloaded at the end of fetch-all.pp if it completes normally.

If your query result set is so large that your WebEnv times out, simply start a new project with that last index of the previous project, and it will pick up the result set from there (with a new WebEnv). (Planned upgrade will automagically start another search.)

Warning: You may lose a (very) few sequences if your download extends across multiple projects. However, our testing shows that the batches generated with the same query within a few days of each other are largely identical.


These are the primary methods that implement the highest abilities of the module. They are the ones found in the included scripts.


These methods are not meant to be used in a stand alone fashion, but if they did, it would look like this.


All of the properties have get_/set_ methods courtesy of Class:Std and the :ATTR feature.

These properties have defaults but each may be overriden by passing them as keys in a hashref to new(). (See the variable $params in the SYNOPSIS above.)


These properties are set by the code.





Feel free to email the authors with questions or concerns. Please be patient for a reply.


Copyleft (C) 2009 by the Authors

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.5 or, at your option, any later version of Perl 5 you may have available.

syntax highlighting: