The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::Homology::InterologWalk - Retrieve, score and visualize putative Protein-Protein Interactions through the orthology-walk method

VERSION

This document describes version 0.02 of Bio::Homology::InterologWalk released August 31st, 2010

SYNOPSIS

  use Bio::Homology::InterologWalk;

First, obtain Intact Interactions for the dataset (see example in getDirectInteractions.pl):

  #get a registry from Ensembl
  my $registry = Bio::Homology::InterologWalk::setup_ensembl_adaptor(
                                                     connect_to_db  => $ensembl_db,
                                                     source_org     => $sourceorg,
                                                     verbose        => 1
                                                     );
  
  
  #query direct interactions
  $RC = Bio::Homology::InterologWalk::Direct::get_direct_interactions(
                                                      registry         => $registry,
                                                      source_org       => $sourceorg,
                                                      input_path       => $in_path,
                                                      output_path      => $out_path,
                                                      url              => $url,
                                                      );

do some postprocessing (see "do_counts" and "extract_unseen_ids" ) and then run the actual interolog walk on the dataset with the following sequence of three methods.

get orthologues of starting set:

  $RC = Bio::Homology::InterologWalk::get_forward_orthologies(
                                              registry        => $registry,
                                              ensembl_db      => $ensembl_db,
                                              input_path      => $in_path,
                                              output_path     => $out_path,
                                              source_org      => $sourceorg,
                                              dest_org        => $destorg,
                                              );

add interactors of orthologues found by get_forward_orthologies():

  $RC = Bio::Homology::InterologWalk::get_interactions(
                                       input_path    => $in_path,
                                       output_path   => $out_path,
                                       url           => $url,
                                       );

add orthologues of interactors found by get_interactions():

  $RC = Bio::Homology::InterologWalk::get_backward_orthologies(
                                               registry    => $registry,
                                               ensembl_db  => $ensembl_db,
                                               input_path  => $in_path,
                                               output_path => $out_path,
                                               error_path  => $err_path,
                                               source_org  => $sourceorg,  
                                               );

do some postprocessing (see "remove_duplicate_rows", "do_counts", "extract_unseen_ids") and then optionally compute a composite score for the putative interactions obtained:

  $RC = Bio::Homology::InterologWalk::Scores::compute_scores(
                                             input_path        => $in_path,
                                             score_path        => $score_path,
                                             output_path       => $out_path,
                                             term_graph        => $onto_graph,
                                             meanscore_it      => $m_it,
                                             meanscore_dm      => $m_dm,
                                             meanscore_me_dm   => $m_mdm,
                                             meanscore_me_taxa => $m_mtaxa
                                             );

get some networks and network attributes which you can then visualise with cytoscape

   $RC = Bio::Homology::InterologWalk::Networks::do_network(
                                            registry        => $registry,
                                            input_path      => $in_path,
                                            output_path     => $out_path,
                                            source_org      => $sourceorg
                                            );
                                               
   $RC = Bio::Homology::InterologWalk::Networks::do_attributes(
                                               registry      => $registry,
                                               input_path    => $in_path,
                                               output_path   => $out_path,
                                               source_org    => $sourceorg,
                                               label_type    => 'external name'
                                               );

The synopsis above only lists the major methods and parameters.

DESCRIPTION

A common activity in computational biology is to mine protein-protein interactions from publicly available databases to build Protein-Protein Interaction (PPI) datasets. In many instances, however, the number of experimentally obtained annotated PPIs is very scarce and it would be helpful to enrich the experimental dataset with high-quality, computationally-inferred PPIs. Such computationally-obtained dataset can extend, support or enrich experimental PPI datasets, and are of crucial importance in high-throughput gene prioritization studies, i.e. to drive hypotheses and restrict the dimensionality of functional discovery problems. This Perl Module, Bio::Homology::InterologWalk, is aimed at building putative PPI datasets on the basis of a number of comparative biology paradigms: the module implements a collection of computational biology algorithms based on the concept of "orthology projection". If interacting proteins A and B in organism X have orthologues A' and B' in organism Y, under certain conditions one can assume that the interaction will be conserved in organism Y, i.e. the A-B interaction can be "projected through the orthologies" to obtain a putative A'-B' interaction. The pair of interactions (A-B) and (A'-B') are named "Interologs".

Bio::Homology::InterologWalk collects, analyses and collates gene orthology data provided by the Ensembl Consortium as well as PPI data provided by EBI Intact. It provides the user with the possibility of rating the quality and reliability of the putative interactions collected by means of confidence scores, and optionally outputs network representations of the datasets, compatible with the biological network representation standard, Cytoscape.

USAGE

Rationale

                              \EBI Intact API/
         .--------------.            |             .-------------.
     (2) | A(e.g. mouse)|<------------------------>|   B(mouse)  |  (3)
         `--------------'          <PPI>           `-------------'
                ^                                         |
   /Ensembl\    | <Orthology>                 <Orthology> | \ Ensembl /
  / Compara \   |                                         |  \Compara/
 /    Api    \  |                                         |   \ Api /
                |                                         | 
         .--------------.                           .-------------.
     (1) | A'(e.g. fly) |. . . . . . . . . . . . .  |   B'(fly)   | (4)
         `--------------'     [SCORED]PUTATIVE PPI  `-------------'
                    (Output of Bio::Homology::InterologWalk)

In order to carry out an interolog walk we start with a set of gene identifiers in one organism of interest (1). We query those ids against a number of comparative biology databases to retrieve a list of orthologues for the gene ids of interest, in one or more species (2). In the next step we rely instead on PPI databases to retrieve the list of available interactors for the protein ids obtained in (2). The output at this stage consists of a list of interactors of the orthologues of the initial gene set, plus several fields of ancillary data (whose importance will be explained later) (3). In the last step of the process we will need to project the interactions in (3) - again using orthology data - back to the original species of interest. The final output is a list of putative interactors for the initial gene set, plus several fields of supporting data.

Bio::Homology::InterologWalk provides three main functions to carry out the basic walk, get_forward_orthologies() , get_interactions() and get_backward_orthologies(). These functions must be called strictly in sequential order in the user's script, as they process, analyse and attach data to the output in a pipeline-like fashion, i.e. working on the output of the preceding function.

get_forward_orthologies

This methods queries the initial gene list against one or more Ensembl DBs (using the Ensembl Perl API) and retrieves their orthologues, plus a number of ancillary data fields ( conservation data, distance from ancestor, orthology type, etc)

get_interactions

This queries the orthology list built in the previous stage against PSICQUIC-enabled PPI DBs using Rest. This step will enrich the dataset built through get_forward_orthologies with the interactors of those orthologues, if any, plus ancillary data (including several parameters describing the quality, nature and origin of the annotated interaction).

get_backward_orthologies

This queries the interactor list built in the previous stage against one or more Ensembl DBs (again using the Ensembl Perl API) to find orthologues back in the original species of interest. It will also adds a number of supplementary information fields, specularly to what done in get_forward_orthologies.

The output of this sequence of subroutines will be a TSV file containing zero or more entries, closely resembling the MITAB tab delimited data exchange format from the HUPO PSI (Proteomics Standards Initiative). Each row in the data file represents a binary putative interaction, plus currently 37 supplementary data fields.

This basic output can then be further processed with the help of other methods in the module: one can scan the results to compute counts, to check for duplicates, to verify the presence of new gene ids that were not present in the original dataset and save them in another datafile, and so on.

Most importantly, the user could need to further process the putative PPIs dataset to do one or more of the following:

  1. Compute a global confidence score to obtain a metric for the reliability of the each binary putative interaction

  2. Extract the binary putative PPIs from the dataset and save them in a format compatible with Cytoscape. This helps providing a visual quality to the result as one could then apply network analysis tools to discover motifs, clusters, well-connected subnetworks, look for GO functional enrichment, and more. The format chosen for the network representation of the dataset is currently .sif. (see http://cytoscape.wodaklab.org/wiki/Cytoscape_User_Manual#Supported_Network_File_Formats) The generation of node attributes is also possible, to allow for visualisation of node tags in terms of simpler human readable labels instead of database IDs.

  3. Obtain a dataset of experimental/direct PPIs (i.e. just plain interactors, no orthology mapping across other taxa involved) from the gene list used as the input to the orthology walk. The reasons why this might be useful are several. The user might want to compare this dataset with the putative PPI dataset also generated by the module to see if/where the two overlap, what is the intersection/difference set, and more. See "get_direct_interactions" for documentation relative to this function. Please also notice a dataset of direct interactions will also be pre-requisite if the user intends to compute confidence values for the putative PPI dataset: the direct PPI dataset is required to compute score normalisation means.

EXAMPLES

In order to demonstrate one way of using the module, four example perl scripts are provided in the scripts/Code directory. Each sample script utilises the module and uses/reuses subroutines in a pipeline fashion. The workflow suggested with the scripts is as follows:

User Input: a textfile containing one gene ID per row. All gene IDs must belong to the same species. All gene IDs must be current Ensembl gene IDs.

1. Mine Direct Interactions.

Generate a dataset of direct PPIs based on the input ID list. See example in getDirectInteractions.pl

2. Run the basic Interolog-Walk Pipeline.

Generate dataset of projected putative PPIs following the paradigm explained earlier. Do some postprocessing on the dataset. See example in doInterologWalk.pl

3. Compute confidence scores for putative PPIs.

Score the dataset obtained in (2.) using the dataset obtained in (1.) to normalise the score components values. See example in doScores.pl

4. Extract network and attributes for the two PPI datasets.

For each of the two datasets obtained from (1) and (2) (putative PPIs) or from (1) and (3) (scored putative PPIs) extract a text file containing a network representation and a text file of node attributes.

See example in doNets.pl

DEPENDENCIES

Bio::Homology::InterologWalk relies on the following prerequisite packages.

Ensembl API

The Ensembl project is currently branched in two sub-projects:

The Ensembl Vertebrates project

This is of interest to you if you work with vertebrate genomes (although it also includes data from a few non-vertebrate common model organisms). See http://www.ensembl.org/index.html for further details.

The Ensembl Genomes project

This utilises the Ensembl software infrastructure (originally developed in the Ensembl Core project) to provide access to genome-scale data from non-vertebrate species. This is of interest to you if your species is a non-vertebrate, or if your species is a vertebrate but you also want to obtain results mapped from non-vertebrates. Bio::Homology::InterologWalk currently only supports the metazoa sub-site from the Ensembl Genomes Project. See http://metazoa.ensembl.org/index.html for further details.

IMPORTANT You will need to decide which Ensembl-DB set you will need prior to installing Bio::Homology::InterologWalk. The module requests that

Ensembl API Version == Ensembl-DB set version.

This means that if you install e.g. API V.58, you will only be able to get data from Ensembl Vertebrates / Metazoa databases V. 58. As the EnsemblGenomes DB releases are one version behind the Ensembl Vertebrate DB release, if you install the bleeding-edge Ensembl Vertebrate API, a matching EnsemblGenomes DB release might not be available yet: you will still be able to use Bio::Homology::InterologWalk to run an orthology walk using exclusively Ensembl Vertebrate DBs, but you will get an error if you try to choose metazoan databases. See "setup_ensembl_adaptor" for further information.

Therefore, before installing Bio::Homology::InterologWalk, you are faced with the following choice:

a)

If you are exclusively interested in vertebrates (plus the few non-vertebrate model organisms still present in Ensembl Vertebrates) then obtain the APIs and set up the environment by following the steps described on the Ensembl Vertebrates API installation pages:

http://www.ensembl.org/info/docs/api/api_installation.html

or alternatively

http://www.ensembl.org/info/docs/api/api_cvs.html

This option allows you to get the most recent datasets provided by Ensembl Core. However, you might not be able to query EnsemblCompara data.

b)

If you are interested in querying/getting back data from vertebrate + metazoan genomes, then obtain the APIs and set up the environment by following the steps described on the Ensembl Metazoa API installation pages: (this allows you to query across a wider selection of taxa)

http://metazoa.ensembl.org/info/docs/api/api_installation.html

or alternatively

http://metazoa.ensembl.org/info/docs/api/api_cvs.html

This option will probably not use the most recent API+DBs, but will guarantee functionality across both Vertebrate and Metazoan genomes.

Option (b) is the recommended one.

NOTE 1: All the API components (ensembl, ensembl-compara, ensembl-variation, ensembl-functgenomics) are required.

NOTE 2: The module has been tested on Ensembl Vertebrates API & DB v. 58 and v. 59 and EnsemblGenomes API & DB v. 5 (58).

Bioperl

Ensembl provides a customised Bioperl installation tailored to its API, v. 1.2.3. Should version 1.2.3 be no more available through Ensembl, please obtain release 1.6.x from CPAN. (while not officially supported by the Ensembl Project it will work fine when using the API within the scope of the present module)

Additional Perl Modules

The following modules (including all dependencies) from CPAN are also required:

1. REST::Client
2. GO::Parser
3. DBD::CSV (requires Perl DBI)
4. String::Approx

See the README file for further information.

INTERFACE

setup_ensembl_adaptor

 Usage     : $registry = Bio::Homology::InterologWalk::setup_ensembl_adaptor(
                                                             connect_to_db   => $ensembl_db,
                                                             source_org      => $sourceorg,
                                                             dest_org        => $destorg,
                                                             verbose         => 1
                                                             );
 Purpose   : This subroutine sets up the registry for connection to the Ensembl API and also gets 
             a species-dependent adaptor out of it
 Returns   : An Ensembl Registry object if successful, undefined in all other cases
 Argument  : -connect_to_db: ensembl db to connect to. Choices currently are:
                 a. 'multi' :  vertebrate compara (see http://www.ensembl.org/)
                 b. 'pan_homology' : pan taxonomic compara db, a selection of species from both 
                    Ensembl Compara and EnsemblGenomes Compara
                    (see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
                 c. 'metazoa' : EnsemblGenomes compara, metazoa db 
                    (see http://metazoa.ensembl.org/index.html). 
                 d. 'all'  : multi + metazoa.
                 Default is 'multi'.
             -source_org: the initial species for the interolog walk. This MUST match with your 
              choice of db. Exception is raised if not
             -(OPTIONAL) dest_org: the destination species to use for the interolog walk. This 
              MUST exist in your choice of db. "all" 
               chooses all the taxa offered by Ensemlb in that DB. Default is 'all'
             -(OPTIONAL) verbose: boolean, shows/hides connection info provided by Ensembl. 
              Default is '0' 
 Throws    : -
 Comment   : Currently the FULL SCIENTIFIC NAME of both the source organism and the destination 
             organism, as specified in Ensembl, is required.
             E.g.: 'Homo sapiens', 'Mus musculus', 'Drosophila melanogaster', etc. 
             Soon to be expanded to support short mnemonic names 
             (e.g.: 'Mmus' instead of 'Mus musculus')

See Also :

remove_duplicate_rows

 Usage     : $RC = Bio::Homology::InterologWalk::remove_duplicate_rows(
                                                       input_path   => $in_path,
                                                       output_path  => $out_path,
                                                       header       => 'standard',
                                                       );
 Purpose   : This is used to clean up a TSV data file of duplicate entries. 
             This routine will make sure no such duplicates are kept. A new datafile 
             is built. The number of unique data rows is updated. 
 Returns   : success/error
 Argument  : -input_path : path to input file. Input file for this subroutine is 
              a TSV file of PPIs. It can be one of the following two:
                 1. the output of get_backward_orthologies(). In this case please 
                    specify 'standard' header below.
                 2. the output of get_direct_interactions(). In this case please 
                    specify 'direct' header below.             
             -output_path : where you want the routine to write the data. Data is in 
              TSV format. 
             -(OPTIONAL)header : Header type is one of the following:
                 1. 'standard': when the routine is used to clean up an interolog-walk 
                    file (the header will be longer)
                 2. 'direct':   when the routine is used to clean up a file of real db 
                    interactions (the header is shorter)
               No field provided: default is 'standard'
 Throws    : -
 Comment   : -

See Also : "get_backward_orthologies", "get_direct_interactions"

get_forward_orthologies

 Usage     : $RC = Bio::Homology::InterologWalk::get_forward_orthologies(
                                                         registry      => $registry,
                                                         ensembl_db    => $ensembl_db,
                                                         input_path    => $in_path,
                                                         output_path   => $out_path,
                                                         source_org    => $sourceorg,
                                                         dest_org      => $destorg,
                                                         hq_only       => 1
                                                         no_output     => 0
                                                         );
 Purpose   : This is the core function to perform the orthology retrieval step of the 
             Interolog mapping algorithm. It will set up some important Ensembl components 
             and then proceed with the composition/computation of the values
 Returns   : success/error code
 Argument  : -registry object to connect to ensembl
             -ensembl db to connect to. Choices currently are:
                 a. 'multi' :  vertebrate compara (see http://www.ensembl.org/)
                 b. 'pan_homology' : pan taxonomic compara db, a selection of species 
                    from both Ensembl Compara and Ensembl Genomes 
                    (see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
                 c. 'metazoa' : ensembl compara genomes, metazoa db 
                    (see http://metazoa.ensembl.org/index.html). 
                 d. 'all'  : multi + metazoa.
                 Default is 'multi'.
             -input_path : path to input file. Input file MUST be a text file with one entry
              per row, each entry containing an up-to-date gene ID recognised by the Ensembl 
              consortium (http://www.ensembl.org/) followed by a new line char.
             -output_path : where you want the routine to write the data. Data is in TSV 
              format.
             -source organism name (eg: 'Mus musculus')
             -(OPTIONAL)destination organism name (eg 'Drosophila melanogaster'). Set this is 
              if you want to carry out the mapping through one 
              specific species, rather than all those available in Ensembl. Default : 'all'
             -(OPTIONAL)hq_only: discards one-to-many, many-to-one, many-to-many orthologues. 
              Only keeps one-to-one orthologues, i.e. where no duplication event has happened 
              after the speciation. One-to-one orthologues are ideally associated with higher 
              functional conservation (while paralogues often cause neo/sub-functionalisation). 
              For further information see
              http://www.ensembl.org/info/docs/compara/homology_method.html 
             -(OPTIONAL) no_output :  suppresses screen output. Used for clearer output during 
              test. Default is 0.
 Throws    : -
 Comment   : 1)Currently the FULL SCIENTIFIC NAME of both the source species and the destination 
               species, as specified in Ensembl, is required.
               E.g.: 'Homo sapiens', 'Mus musculus', 'Drosophila melanogaster', etc. 
               Soon to be expanded to support short mnemonic names (e.g.: 'Mmus' instead of 
               'Mus musculus')
             2)EXPERIMENTAL: early support for human readable gene names in the input file has been 
               added. Such gene names will be checked against Ensembl so they must be recognisable 
               by it.

See Also :

get_interactions

 Usage     : $RC = Bio::Homology::InterologWalk::get_interactions(
                                                  input_path     => $in_path,
                                                  output_path    => $out_path,
                                                  url            => $url,
                                                  no_spoke       => 1, 
                                                  exp_only       => 1, 
                                                  physical_only  => 1,
                                                  no_output      => 0 
                                                  );
 Purpose   : this methods allows  to query the Intact database using the REST interface. 
             IntAct is the Molecular Interaction database at the European Bioinformatics 
             Institute (UK). The Intact project offers programmatic access to their data 
             through the PSICQUIC specification 
             (see http://code.google.com/p/psicquic/wiki/PsicquicSpecification).
             This subroutine interrogates via Rest the Intact PPI db with a list of ensembl
             gene ids (obtained usually from get_forward_orthologies()), obtains data in 
             the PSI-MI TAB format (see http://code.google.com/p/psimi/wiki/PsimiTabFormat), 
             processes it and appends it to the input data. 
 Returns   : success/failure code
 Argument  : -input_path : path to input file. Input file for this subroutine is the 
              output of get_forward_orthologies()
             -output_path : where you want the routine to write the data. Data is in TSV 
              format.
             -url : url for the REST service to query (currently only EBI Intact PSICQUIC 
              Rest)
             -(OPTIONAL) no_spoke: if set, interactions obtained from the expansion of 
              complexes through the SPOKE method 
              (see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D525)
              will be ignored
             -(OPTIONAL) exp_only: if set, only interactions whose MITAB25 field "Interaction 
              Detection Method" (MI:0001 in the PSI-MI controlled vocabulary) is at 
              least "experimental interaction detection" 
              (MI:0045 in the PSI-MI controlled vocabulary) will be retained. I.e. if set, 
              this flag only allows experimentally detected interactions to be retained and 
              stored in the data file
             -(OPTIONAL) physical_only: if set, only interactions whose MITAB25 field 
              "Interaction Type" (MI:0190 in the PSI-MI controlled vocabulary) is at least 
              "physical association" 
              (MI:0915 in the PSI-MI controlled vocabulary) will be retained. I.e. if set, 
              this flag only allows physically associated PPIs to be retained and stored 
              in the data file: colocalizations and genetic interactions will be discarded
             -(OPTIONAL) no_output :  suppresses screen output. Used for clearer output 
              during test. Default is 0.
 Throws    : -
 Comment   : -will soon be extended to work with other PSICQUIC-enabled protein interaction 
              dbs (for a list, see 
              http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS)
             -need to merge with get_direct_interactions. Maybe create core sub, then share.

See Also : "get_forward_orthologies"

get_backward_orthologies

 Usage     : $RC = Bio::Homology::InterologWalk::get_backward_orthologies(
                                                          registry      => $registry,
                                                          ensembl_db    => $ensembl_db,
                                                          input_path    => $in_path,
                                                          output_path   => $out_path,
                                                          error_path    => $err_path,
                                                          source_org    => $sourceorg,    
                                                          hq_only       => $onetoone,
                                                          no_output     => 0
                                                          );
 Purpose   : this routine mines orthologues back into the organism of interest. It accepts 
             as an input a data file containing interactions in the destination organism(s) 
             and maps those back to the source organism through orthology. 
             Such orthologues represent the putative interactors of the original genes as 
             requested.
 Returns   : success/error
 Argument  : -registry: registry object for ensembl connection
             -ensembl db to connect to. Choices currently are:
                 a. 'multi' :  vertebrate compara (see http://www.ensembl.org/)
                 b. 'pan_homology' : pan taxonomic compara db, a selection of species from 
                    both Ensembl Compara and Ensembl Genomes 
                    (see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D563 )
                 c. 'metazoa' : ensembl compara genomes, metazoa db 
                    (see http://metazoa.ensembl.org/index.html). 
                 d. 'all'  : multi + metazoa.
                 Default is 'multi'.
             -input_path : path to input file. Input file for this subroutine is the output 
              of get_interactions().
             -output_path : where you want the routine to write the data. Data is in TSV 
              format.
             -(OPTIONAL)error_path: each query to intact through psicquic returns a data 
               entry including a binary protein interaction. The two ids returned are, most 
               of the times, uniprotkb ids. Sometimes, however, Intact annotates its binary 
               interactions using an internal, proprietary ID (e.g.: EBI-1080281 ). While the 
               Ensembl API recognises UniprotKB IDs,it won't recognise these Intact IDs. Entries 
               annotated in such a way cannot therefore be completed. If error_path is present, 
               it indicates a file where the routine will dump all such failed entries for later 
               manual inspection.
             -source organism name (eg: "Mus musculus")
             -(OPTIONAL)hq_only: discards one-to-many, many-to-one, many-to-many orthologues. 
              Only keeps one-to-one orthologues, i.e. where no duplication event has happened 
              after the speciation. One-to-one orthologues are ideally associated with higher 
              functional conservation (while paralogues often cause neo/sub-functionalisation). 
              For further information see
              http://www.ensembl.org/info/docs/compara/homology_method.html 
             -(OPTIONAL) no_output :  suppresses screen output. Used for clearer output during 
              test. Default is 0.
 Throws    : -
 Comment   : Destination species is automatically dealt with on a case-to-case basis.
           : 'ensembl_db' must be the same  for all the other subroutines in the pipeline

See Also : "get_interactions"

do_counts

 Usage     : $RC = Bio::Homology::InterologWalk::do_counts(
                                           input_path  => $in_path,
                                           output_path => $out_path,
                                           header      => 'standard',
                                           no_output   => 0
                                           );
 Purpose   : The purpose of this routine is to scan the data produced by get_backward_orthologies() 
             or get_direct_interactions() (optionally cleaned up of duplicates by 
             remove_duplicate_rows() ) and compute counts/statistics useful for scoring purposes.
             In short, the subroutine:
             1)evaluates if an interaction has been obtained through more than one detection method
             2)evaluates if an interaction has been obtained through more than one taxon
             3)COUNTS the number of *unique* putative interactions found: remember that the same 
               interaction can be retrieved through several different interacting destination-species 
               orthologues. This script also adds the retrieved "number seen" number and appends it 
               to the TSV file. 
             4)flags the entry with Y if the putative interaction is an autointeraction
             5)flags the entry if the real interaction in the destination species (the one we are 
               mapping from) is an autointeraction
             The routine rewrites the input file in a new file, adding 1 or more data fields 
             (depending on the 'header' argument) containing the results of the count.
 Returns    : success/fail
 Argument   : -input_path : path to input file. Input file for this subroutine is a TSV file of PPIs. 
              It can be one of the following two:
                 1. the output of get_backward_orthologies(). In this case please specify 'standard' 
                    header below.
                 2. the output of get_direct_interactions(). In this case please specify 'direct' 
                    header below.   
              It is advisable to pre-process the input by using remove_duplicate_rows() prior to 
              this routine.          
             -output_path : where you want the routine to write the data. Data is in TSV format. 
             -(OPTIONAL) header : Header type is one of the following:
                 1. 'standard': when the routine is used to compute counts on an interolog-walk file 
                    (the header will be longer)
                 2. 'direct':   when the routine is used to compute counts on a real db interactions 
                    file (the header is shorter)
               No field provided: default is 'standard'
             -(OPTIONAL) no_output :  suppresses screen output. Used for clearer output during test. 
              Default is 0.
 Throws    : -
 Comment   : -

See Also : "get_backward_orthologies", "get_direct_interactions", "remove_duplicate_rows"

extract_unseen_ids

 Usage     : $RC = Bio::Homology::InterologWalk::extract_unseen_ids(
                                                    start_path    => $start_data_path,
                                                    input_path    => $in_path,
                                                    output_path   => $out_path,
                                                    hq_only       => $onetoone,
                                                    );
 Purpose   : it is often desirable to know if the interolog procedure found new ids at all 
             (i.e. not present in the starting dataset). Such new ids can then be analysed 
             further, ie. sent through GO term enrichment analysis, etc, to provide some 
             validation, see if they have been know before to belong to some specific process, 
             check if no function is associated to them at all.
             This script will create a simple textfile containing all the new ids discovered.
             This script is meant to be employed as a last step in the pipeline. It also 
             computes some simple statistics as follows:
             1. The list of NEW ids, ie those not present in the initial data file
             2. The frequencies, new vs total, old vs total
             3. the frequencies of new when the Expansion Method is not spoke and when orthology 
             is one_to_one (i.e.: new ids with high reliability)
 Returns   : Success/Fail
 Argument  : -start_path: path to the original text file with the ids of interest (the same file 
              given to get_forward_orthologies() as input)
             -input_path : path to input file. Input file for this subroutine is the output of 
              do_counts().
             -output_path : where you want the routine to write the data. Data is in TSV format.
             -(OPTIONAL)hq_only : if this is set, only entries mapped exclusively through 
              one-to-one orthologies will be taken into account.
 Throws    : -
 Comment   : -

See Also : "get_forward_orthologies", "do_counts"

parse_ontology

 Usage     : $onto_graph = 
                 Bio::Homology::InterologWalk::Scores::parse_ontology(
                                                                    $ont_path
                                                                    );
 Purpose   : This subroutine accepts one input, a path to a PSI-MI ontology file. 
             It uses GO::Parser to parse the file and returns a graph object of 
             the ontology: a structured graph-representation of it, that we can 
             walk and explore. This is useful when we need to look at the detection 
             method and at the interaction type for each entry. E.g. we might be in-
             terested in all interactions tagged generically with "experimental 
             detection method" but also in all the interactions tagged with 
             a *specific* detection method (ie a specialised subclass of the concept 
             "experimental detection method").
             Analysing the structure of the ontology through the graph returned by 
             this method helps in doing that.
 Returns   : A graph object containing the structured ontology
 Argument  : The path to the psi-mi ontology file
 Throws    : -
 Comment   : -

See Also : "get_mean_scores"

get_mean_scores

 Usage     : ($m_em, $m_it, $m_dm, $m_mdm) = 
             Bio::Homology::InterologWalk::Scores::get_mean_scores(
                                                             $intact_path,
                                                             $onto_graph
                                                             );
 Purpose   : This is used to compute suitable mean values to normalise the components of 
             the score for
             - interaction type, 
             - interaction detection method
             - experimental method
             - multiple detection method.
             Each value is the MEAN value for the corresponding score computed on the set of 
             direct experimental interactions for the initial dataset. These are used to 
             normalise the scores obtained for the corresponding putative interactions.
 Returns   : a list of four numbers: 
             1. mean experimental method score
             2. mean interaction type score, 
             3. mean detection method score
             4. mean multiple dm score
 Argument  : 1) path to a tsv file of REAL intact interactions 
                (generated by get_direct_interactions())
             2) a graph representation of the obo PSI MI ontology 
                (generated by parse_ontology() )
 Throws    : -
 Comment   : -

See Also : "parse_ontology", "get_direct_interactions"

compute_scores

 Usage     : $RC = Bio::Homology::InterologWalk::Scores::compute_scores(
                                                        input_path        => $in_path,
                                                        score_path        => $score_path,
                                                        output_path       => $out_path,
                                                        term_graph        => $onto_graph,
                                                        meanscore_em      => $m_em,
                                                        meanscore_it      => $m_it,
                                                        meanscore_dm      => $m_dm,
                                                        meanscore_me_dm   => $m_mdm,
                                                        meanscore_me_taxa => $m_mtaxa,
                                                        no_output         => 0
                                                        );
 Purpose   : This is used to analyse several ancillary data fields obtained alongside the actual 
             putative PPI IDs and collate them into a global confidence score, which should provide 
             a measure of the reliability of each putative PPI. The score will take into account 
             a number of variables related to each of the steps involved in the orthology walk. 
             We can divide the meta-score related components in two broad classes:
             - parameters related to the interaction. These include: Interaction Type, Interaction 
               Detection Method, Interaction coming from a SPOKE-expanded complex, interaction recon-
               firmed through multiple taxa, interaction reconfirmed through multiple detection methods
             - parameters related to the two orthology mappings. These include: orthology type 
               (one-to-one, one-to-many, many-to-one, many-to-many), OPI (percentage identity of the 
               conserved columns - see Bio::SimpleAlign), node to node distance, distance from the 
               first shared ancestor, (under development) dN/dS ratio
             The score computation will also involve a normalisation stage. The subroutine requires 
             five arguments (meanscore_x) representing mean values to be used for normalisation.
             The actually means are computed in get_mean_scores(), which is pre-requisite to 
             compute_scores().
 Returns   : success/failure
 Argument  : -input_path : path to the input tsv file. A suitable input for this subroutine is the 
             final output of the orthology walk pipeline (see doInterologWalk.pl for usage guidelines).
              input file should have .06out extension
             -(OPTIONAL) score_path : path to  text file where scores will be saved one per row 
              (useful for looking at score distributions eg through matlab)
              output textfile has a .scores extension 
             -output_path : where you want the routine to write the data. Data is in TSV format. 
              File extension is .07out
             -term_graph :  a Go::Parser graph object obtained from parse_ontology() containing a 
              network representation of the PSI-MI controlled vocabulary of terms.
             -meanscore_em : mean experimental method score for normalisation
             -meanscore_it : mean interaction type score for normalisation  
             -meanscore_dm : mean detection method score for normalisation   
             -meanscore_me_dm : mean 'multiple detection methods' score for normalisation
             -meanscore_me_taxa : mean 'multiple taxa' score for normalisation
             -(OPTIONAL) no_output :  suppresses screen output. Used for clearer output during test. 
              Default is 0.
 Throws    : -
 Comment   : -

See Also : http://search.cpan.org/~cjfields/BioPerl-1.6.1/Bio/SimpleAlign.pm#overall_percentage_identity, "get_mean_scores", doScores.pl for sample usage

compute_multiple_taxa_mean

 Usage     : $m_mtaxa = Bio::Homology::InterologWalk::Scores::compute_multiple_taxa_mean(
                                                  ds_size    => 500,   
                                                  ds_number  => 3,    
                                                  datadir    => $path     
                                                  );
 Purpose   : Suppose you want to run an interolog walk starting from initial interactor X. 
             You get a final putative interactor Y.
             Now suppose the putative interactor is the output of more than an interolog walk,
             each one based on a interaction annotated in a different organism.
             When building the score, we would like to account for the fact that a putative 
             interaction obtained through interactions in multiple species is more reliable than 
             one obtained through only one species. In order to weight the Multiple_Taxa Score, 
             however, a long procedure is required. It is not possible to use a mean taken from 
             the direct Intact interactions data file, for obvious reasons. 
             My solution can be currently summarised as follows:
             1. choose n<7 random taxa from a vector containing 7 well-supported NCBI taxa id;
             2. choose m (~500) random genes for each of the n taxa
             3. run the full orthology walk using the methods from Bio::Homology::InterologWalk 
                on each of the n datasets
             4. compute a mean_multiple taxa score for each of them
             5. Final Mean_Multiple_Taxa_Score = mean(M_1,M_2,M_3) 
             The procedure is LONG and SLOW and might lead to Ensembl refusing connections 
             in some instances.
 Returns   : the mean multiple taxa score, i.e. the global mean of the multiple taxa scores 
             obtained for each random dataset
 Argument  : ds_size : number of ids per dataset, eg 500
             ds_number : a number between 1 and 7, equal to the number of taxa to randomly 
                         pick from
             datadir :  work directory
 Throws    : -
 Comment   : #TODO should randomised data be saved and reused?
             Lots of hard coded stuff in here. Intact url is hard coded, etc. 
             Need to review.

See Also :

do_network

 Usage     : $RC = Bio::Homology::InterologWalk::Networks::do_network(
                                                      registry       => $registry,
                                                      input_path     => $in_path,
                                                      output_path    => $out_path,
                                                      source_org     => $sourceorg,
                                                      orthology_type => $orthtype,
                                                      expand_taxa    => 1,
                                                      ensembl_db     => $ensembl_db,
                                                      no_output      => 0
                                                      );
 Purpose   : This function  writes a .SIF file according to this cytoscape specification in:
             http://cytoscape.org/cgi-bin/moin.cgi/Cytoscape_User_Manual/Network_Formats.
             For each input data row, the subroutine will extract the initial id, the 
             (putative) PPI found, taxon information(optional), score (if present). 
             It will look up the ID pair on the Ensemble API and obtain gene names. 
             It will output a TSV file with .sif extension.
             The routine can expand taxa information: it might be useful in some cases to 
             know the taxon from which the putative PPI has been mapped
             Eg.: instead of  A--B (default behaviour) one can decide to get A-mouse-B, 
             A-human-B, A-fly-B, etc.
             The routine should work both with a putative PPIs data file and with a direct
             interactions data file 
             (it will look at the input file header to decide what it is dealing with).
 Returns   : success/ failure
 Argument  : -registry: ensembl registry object to connect to. Needed to retrieve up-to-date
              human readable gene names from Ensembl for the IDs in the input data file
             -input_path : path to input file. Input file for this subroutine is a tsv file 
              containing at least the fields INIT_ID and INTERACTOR.
              (output of get_backward_orthologies() or get_direct_interactions will work, 
              although output of remove_duplicate_rows() is recommended). 
              Optionally, if interaction scores are desired, input fill will have to be the 
              output of the scoring pipeline (see example file doScores.pl)
             -output_path : where you want the routine to write the data. Data is a TSV .sif 
              cytoscape file.
             -source organism name (eg: "Mus musculus")
             -(OPTIONAL) orthology_type: can be set to 'onetoone': if so, only entries 
              obtained through "one to one" orthology projections will be retained in the output. 
              Default: all orthologies retained.
             -(OPTIONAL) expand_taxa: if true, information related to the species from which 
              the putative PPI has been projected will be retained.
              if this is set to true, ensemb_db MUST also be set
             -(OPTIONAL) ensembl_db: only required if expand_taxa is set. For allowed values, 
              see get_forward_orthologies()
             -(OPTIONAL) no_output :  suppresses screen output. Used for clearer output during
              test. Default is 0.
 Throws    : -
 Comment   : in order to account for the fact that the edges of the network are undirected 
             (eg A-B = B-A), I can either
             1) cache both couples (eg (A,B) and (B,A)) so I'll find them both when I look up
             2) do a lexicographic sorting before caching and before looking-up the cache
             I use the second option at the moment.

See Also : "remove_duplicate_rows", "compute_scores", "get_forward_orthologies"

do_attributes

 Usage     : $RC = Bio::Homology::InterologWalk::Networks::do_attributes(
                                                         registry      => $registry,
                                                         input_path    => $in_path,
                                                         output_path   => $out_path,
                                                         source_org    => $sourceorg,
                                                         label_type    => 'extname',
                                                         no_output      => 0
                                                         );
 Purpose   : This is needed to create a node attribute file to go with the .sif 
             network created by do_networks(). 
             For a definition of node attribute file, see
             http://cytoscape.wodaklab.org/wiki/Cytoscape_User_Manual#Node_and_Edge_Attributes 
             The routine associates, for each stable id in the sif file, a human-readable 
             gene name/description obtained from Ensembl
 Returns   : a boolean value for success/failure
 Argument  : -registry: ensembl registry object to connect to. Needed to retrieve up-to-date
              human readable gene names from Ensembl for the IDs in the input data file
             -input_path : path to input file. Input file for this subroutine is a tsv file 
              containing at least the fields INIT_ID and INTERACTOR.
              (output of get_backward_orthologies() or get_direct_interactions will work, 
              although output of remove_duplicate_rows() is recommended). 
             -output_path : where you want the routine to write the data. Data is a .noa 
              cytoscape file.
             -source_org : source organism name (eg: "Mus musculus")
             -(OPTIONAL) label_type: what kind of human readable string to employ. Options 
              are 'extname' (external name) and 'description'. Default is 'extname'
             -(OPTIONAL) no_output :  suppresses screen output. Used for clearer output 
              during test. Default is 0.
 Throws    : -
 Comment   : -

See Also : "do_network"

get_direct_interactions

 Usage     : $RC = Bio::Homology::InterologWalk::Direct::get_direct_interactions(
                                                                 registry        => $registry,
                                                                 source_org      => $sourceorg,
                                                                 input_path      => $in_path,
                                                                 output_path     => $out_path,
                                                                 url             => $url,
                                                                 check_ids       => 1,   
                                                                 no_spoke        => 1, 
                                                                 exp_only        => 1, 
                                                                 physical_only   => 1, 
                                                                 no_output       => 0 
                                                                 );
 Purpose   : this methods allows  to query the Intact database using the REST interface. 
             IntAct is the Molecular Interaction database at the European Bioinformatics 
             Institute (UK). The Intact project offers programmatic access to their data 
             through the PSICQUIC specification (see 
             http://code.google.com/p/psicquic/wiki/PsicquicSpecification).
             This routine is different and more complex than get_interactions() from the 
             main module. This one is meant to query intact directly with the ids provided 
             by the user: no intermediate orthologues from ensembl are collected.
             The bulk of the script is used for the following reason: each query to intact 
             through psicquic returns a data entry including a binary protein interaction, 
             and the the two ids returned are uniprotkb or other protein ids. 
             We need to
                a- convert both to a format recognised by ensembl
                b- identify which of the two corresponds to our initial id
                c- convert the other one to ensembl and store it in the file
             This conversion is not trivial as the possibility of ambiguities/errors/wrong 
             matches between ensembl gene representations and uniprot protein representations 
             is high.
 Returns   : return code for error/success 
 Argument  : -registry: registry object to connect to Ensembl
             -source_org : source organism name (eg: "Mus musculus")
             -input_path : path to input file. Input file MUST be a text file with one entry 
              per row, each entry containing an up-to-date
              gene ID recognised by the Ensembl consortium (http://www.ensembl.org/) followed 
              by a new line char.
             -output_path : where you want the routine to write the data. Data is in TSV format.
             -url : url for the REST service to query (currently only EBI Intact PSICQUIC Rest)
             -(OPTIONAL) check_ids : if true, every interactor id found in intact data will 
              be double checked against ensembl.
              this is useful because intact dbs sometimes contain obsolete versions of some 
              ids. However chosing true will significantly slow down the processing
             -(OPTIONAL) no_spoke: if set, interactions obtained from the expansion of 
               complexes through the SPOKE method 
              (see http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D525)
              will be ignored
             -(OPTIONAL) exp_only: if set, only interactions whose MITAB25 field 
              "Interaction Detection Method" 
              (MI:0001 in the PSI-MI controlled vocabulary) is at least "experimental 
              interaction detection" 
              (MI:0045 in the PSI-MI controlled vocabulary) will be retained. I.e. if set, 
              this flag only allows 
              experimentally detected interactions to be retained and stored in the data file
             -(OPTIONAL) physical_only: if set, only interactions whose MITAB25 field 
              "Interaction Type" 
              (MI:0190 in the PSI-MI controlled vocabulary) is at least "physical association" 
              (MI:0915 in the PSI-MI controlled vocabulary) will be retained. I.e. 
              if set, this flag only allows 
              physically associated PPIs to be retained and stored in the data file: 
              colocalizations and genetic interactions will be discarded
             -(OPTIONAL) no_output :  suppresses screen output. Used for clearer output 
              during test. Default is 0.
 Throws    : -
 Comment   : -

See Also : "get_interactions"

BUGS AND LIMITATIONS

This is ALPHA software. There will be bugs. The interface may change. Please be careful. Do not rely on it for anything mission-critical.

Please report any bugs you find, bug reports and any other feedback are most welcome.

-Currently only the EBI Intact DB is available for PPI retrieval. This will be expanded to account for all available PSICQUIC-compliant PPI dbs transparently. This includes MINT, STRING, BioGrid and many more. For a full list of compliant DBs and for the status of the PSICQUIC service, check

http://www.ebi.ac.uk/Tools/webservices/psicquic/registry/registry?action=STATUS

AUTHOR

Giuseppe Gallone <ggallone@cpan.org>

CPAN ID: GGALLONE

University of Edinburgh

LICENSE AND COPYRIGHT

Bio::Homology::InterologWalk is Copyright (c) 2010 Giuseppe Gallone All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.10.1 or, at your option, any later version of Perl 5 you may have available.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

ACKNOWLEDGEMENTS

The author would like to thank the following individuals and organisations for their invaluable support and priceless suggestions.

Andrew Jarman, Ian Simpson, Douglas Armstrong and all the Jarman Lab, University of Edinburgh

Javier Herrero, Albert Vilella, Andy Yates, Glenn Proctor, Michael Han, Gautier Koscielny and all the Ensembl Team

Samuel Kerrien & Bruno Aranda and all the EBI-Intact Team

Dave Messina, BioPerl list