Data::Babel - Translator for biological identifiers
Version 1.13
use Data::Babel; use Data::Babel::Config; use Class::AutoDB; use DBI; # open database containing Babel metadata my $autodb=new Class::AutoDB(database=>'test'); # try to get existing Babel from database my $babel=old Data::Babel(name=>'test',autodb=>$autodb); unless ($babel) { # Babel does not yet exist, so we'll create it # idtypes, masters, maptables are names of configuration files that define # the Babel's component objects $babel=new Data::Babel (name=>'test',idtypes=>'examples/idtype.ini',masters=>'examples/master.ini', maptables=>'examples/maptable.ini'); } # open database containing real data my $dbh=DBI->connect("dbi:mysql:database=test",undef,undef); # CAUTION: rest of SYNOPSIS assumes you've loaded the real database somehow # translate several Entrez Gene ids to other types my $table=$babel->translate (input_idtype=>'gene_entrez', input_ids=>[1,2,3], output_idtypes=>[qw(gene_symbol gene_ensembl chip_affy probe_affy)]); # print a few columns from each row of result for my $row (@$table) { print "Entrez gene=$row->[0]\tsymbol=$row->[1]\tEnsembl gene=$row->[2]\n"; } # same translation but limit results to Affy hgu133a my $table=$babel->translate (input_idtype=>'gene_entrez', input_ids=>[1,2,3], filters=>{chip_affy=>'hgu133a'}, output_idtypes=>[qw(gene_symbol gene_ensembl chip_affy probe_affy)]); # generate a table mapping all Entrez Gene ids to UniProt ids my $table=$babel->translate (input_idtype=>'gene_entrez', output_idtypes=>[qw(protein_uniprot)]); # convert to HASH for easy programmatic lookups my %gene2uniprot=map {$_[0]=>$_[1]} @$table; # count number of Entrez Gene ids represented on Affy hgu133a my $count=$babel->count (input_idtype=>'gene_entrez',filters=>{chip_affy=>'hgu133a'}); # tell which input ids are valid my $table=$babel->validate (input_idtype=>'gene_entrez', input_ids=>[1,2,3]); # print validity status of each for my $row (@$table) { my($input_id,$valid,$current_id)=@$row; print "Entrez gene $input_id is ", ($valid? "valid with current value $current_id": 'invalid'),"\n"; }
Data::Babel translates biological identifiers based on information contained in a database. Each Data::Babel object provides a unique mapping over a set of identifier types. The system as a whole can contain multiple Data::Babel objects; these may share some or all identifier types, and may provide the same or different mappings over the shared types.
The principal method is "translate" which converts identifiers of one type into identifiers of one or more output types. In typical usage, you call "translate" with a list of input ids to convert. You can also call it without any input ids (or with the special option "input_ids_all" set) to generate a complete mapping of the input type to the output types. This is convenient if you want to hang onto the mapping for repeated use. You can also filter the output based on values of other identifier types.
Comparisons are done in a case insensitive manner. This includes input ids, filters, and internal comparisons used to join database tables. For example, when translating the gene symbol 'HTT' (the human Huntington Disease gene), you will also get information on gene symbol 'Htt' (the mouse and rat ortholog of the human gene) assuming, of course, this information is in the database.
CAVEAT: Some features of Data::Babel are overly specific to the procedure we use to construct the underlying Babel database. We note such cases when they arise in the documentation below.
The main components of a Data::Babel object are
a master list of valid values for the type, and
optionally, a history mapping old values to current ones
One typically defines these components using configuration files whose basic format is defined in Config::IniFiles. See examples in "Configuration files" and the examples directory of the distribution.
Each MapTable represents a relational table stored in the database and provides a mapping over a subset of the Babel's IdTypes; the ensemble of MapTables must, of course, cover all the IdTypes. The ensemble of MapTables must also be non-redundant as explained in "Technical details".
MapTables must always contain current identifiers, even for IdTypes that have histories (more precisely, for IdTypes whose Masters have histories). The query or program that loads the database is responsible for mapping old identifiers to current ones (presumably via the history).
"translate" checks the input IdType to see if its Master has history information. If so, "translate" automatically applies the history to all input ids. It does the same for filters.
You need not explicitly define Masters for all IdTypes; Babel will create 'implicit' Masters for any IdTypes lacking explicit ones. An implicit Master has a list of valid identifiers but no history and could be implemented as a view over all MapTables containing the IdType. In the current implementation, we use views for IdTypes contained in single MapTables but construct actual tables for IdTypes contained in multiple MapTables.
Our configuration files use 'ini' format as described in Config::IniFiles: 'ini' format files consist of a number of sections, each preceded with the section name in square brackets, followed by parameter names and their values.
There are separate config files for IdTypes, Masters, and MapTables. There are complete example files in the distribution. Here are some excerpts:
IdType
[chip_affy] display_name=Affymetrix array referent=array defdb=affy meta=name format=/^[a-z]+\d+/ sql_type=VARCHAR(32)
The section name is the IdType name. The parameters (all optional) are
display_name. human readable name for this type
referent. the type of things to which this type of identifier refers
defdb. the database, if any, responsible for assigning this type of identifier
meta. some identifiers are purely synthetic (eg, Entrez gene IDs) while others have some mnemonic content; legal values are
eid (meaning synthetic)
symbol
name
description
format. Perl format of valid identifiers
sql_type. SQL data type
As of version 1.11, it is also possible to specify "history" for an IdType. Previously, you could only specify "history" for the IdType's Master.
Master
[gene_entrez_master] history=1
The section name is the Master name; the name of the IdType is the same but without the '_master'. If there is no history, the section can be empty, eg,
[probe_id_master]
A Master without history is implemented as a one column table whose column has the same name as the IdType.
A Master with history is implemented as a two column table: one column has the same name as the IdType and the other has the prefix '_X_' prepended to the IdType. The '_X_' column contains ids that were valid in the past or are valid now. Each row maps the '_X_' id to its current value, if any, or NULL. For ids that are valid now, the table contains a row in which the '_X_' and current versions are the same.
MapTable
[gene_entrez_information] idtypes=gene_entrez gene_symbol gene_description organism_name_common
The section name is the MapTable name. The idtypes parameter tells the names of the IdTypes that the MapTable contains.
By default, the "translate" method does not return any output for input identifiers that do not connect to any identifiers of the desired output types; these are output rows in which the output columns are all NULL. You can instruct "translate" to include these rows in the result by setting the "validate" option.
An input identifier can fail to connect for two reasons:
If you set the "validate" option, the output will contain at least one row for each input identifier, and an additional column that indicates whether the input identifier is valid.
If no output IdTypes are specified, "translate" returns a row containing one element, namely, the input identifier, for each input id that exists in the corresponding Master table. If the "validate" option is set, the output will contain one row for each input identifier; this is essentially a (possibly re-ordered) copy of the input list with duplicates removed.
A partial duplicate is a row that contains less information than another row and is therefore redundant. More precisely, a row is a partial duplicate of another row if for all fields (1) the rows are identical or, (2) the field in the partial duplicate is NULL. In the example below, the second row is a partial duplicate of the first.
gene_symbol organism_name gene_entrez probe_id HTT human 3064 A_23_P212749 HTT human 3064
By default, "translate" removes partial duplicates. The algorithm for removing partial duplicates may be slow for queries with a large number of output columns in cases where a given input id matches a large number of output ids. To retain partial duplicates, you can specify the 'keep_pdups' option to "translate".
A basic Babel property is that translations are stable. You can add output types to a query without changing the answer for the types you had before, you can remove output types from the query without changing the answer for the ones that remain, and if you "reverse direction" and swap the input type with one of the outputs, you get everything that was in the original answer.
We accomplish this by requiring that the database of MapTables satisfies the universal relation property (a well-known concept in relational database theory), and that "translate" retrieves a sub-table of the universal relational. Concretely, the universal relational is the natural full outer join of all the MapTables. "translate" performs natural left out joins starting with the Master table for the input IdType and then including enough tables to connect the input, output, and filter IdTypes. Left outer joins suffice, because "translate" starts with the Master.
We further require that the database of MapTables be non-redundant. The basic idea is that a given IdType may not be present in multiple MapTables, unless it is being used as join column. More technically, we require that the MapTables form a tree schema (another well-known concept in relational database theory), and any pair of MapTables have at most one IdType in common. As a consequence, there is essentially a single path between any pair of IdTypes.
To represent the connections between IdTypes and MapTables we use an undirected graph whose nodes represent IdTypes and MapTables, and whose edges go between each MapTable and the IdTypes it contains. In this representation, a non-redundant schema is a tree.
"translate" uses this graph to find the MapTables it must join to connect the input, output, and filter IdTypes. The algorithms is simple: start at the leaves and recursively prune back branches that do not contain the input or output IdTypes.
Title : new Usage : $babel=new Data::Babel name=>$name, idtypes=>$idtypes,masters=>$masters,maptables=>$maptables Function: Create new Data::Babel object or fetch existing object from database and update its components. Store the new or updated object. Returns : Data::Babel object Args : name eg, 'test' idtypes, masters, maptables define component objects; see below old existing Data::Babel object in case program already fetched it (typically via 'old') autodb Class::AutoDB object for database containing Babel. class method often set before running "new" Notes : "name" is required. All other args are optional
The component object parameters can be any of the following:
Title : old Usage : $babel=old Data::Babel($name) -- OR -- $babel=old Data::Babel(name=>$name) Function: Fetch existing Data::Babel object from database Returns : Data::Babel object or undef Args : name of Data::Babel object, eg, 'test' if keyword form used, can also specify autodb to set the corresponding class attribute
The available object attributes are
name eg, 'test' id name prefixed with 'babel', eg, 'babel:test'. not really used. exists for compatibility with component objects idtypes ARRAY of this Babel's Data::Babel::IdType objects masters ARRAY of this Babel's Data::Babel::Master objects maptables ARRAY of this Babel's Data::Babel::MapTable objects
The available class attributes are
autodb Class::AutoDB object for database containing Babel
Title : translate Usage : $table=$babel->translate (input_idtype=>'gene_entrez', input_ids=>[1,2,3], filters=>{chip_affy=>'hgu133a'}, output_idtypes=>[qw(transcript_refseq transcript_ensembl)], limit=>100) Function: Translate the input ids to ids of the output types Returns : table represented as an ARRAY of ARRAYS. Each inner ARRAY is one row of the result. The first element of each row is an input id. If the validate option is set, the second element of each row indicates whether the input id is valid. The rest are outputs in the same order as output_idtypes Args : input_idtype name of Data::Babel::IdType object or object input_ids id or ARRAY of ids to be translated. If absent or undef, all ids of the input type are translated. If an empty array, ie, [], no ids are translated and the result will be empty. input_ids_all boolean. If true, all ids of the input type are translated. Same as omitting input_ids or setting it to undef but more explicit. output_idtypes ARRAY of names of Data::Babel::IdType objects or objects filters specification of conditions limiting the output; see below. validate boolean. If true, the output will contain at least one row for each input id and an additional column indicating whether the input id is valid. limit maximum number of rows to retrieve count boolean. If true, return number of output rows rather than the rows themselves. Equivalent to "count" method. keep_pdups boolean. If true, partial duplicates are not removed from the result.
"translate" retains duplicate output columns.
The order of output rows is arbitrary.
If input_ids is absent or undef, it translates all ids of the input type.
Duplicate input_ids are ignored.
If input_ids is an empty ARRAY, ie, [], the result will be empty.
It is an error to set both input_ids and input_ids_all.
It is legal to specify a filter on the input idtype. This constrains the input ids to ones that pass the filter and may be especially useful when processing all input ids,
Input and filter ids can be old (valid in the past) or current (valid now). Output ids are always current.
By default, "translate" does not return rows in which the output columns are all NULL. Setting "validate" changes this and ensures that every input id will appear in the output.
If "count" and "limit" both set, the result is the number of output rows after the limit is applied and will always be <= the limit.
If "validate" and "limit" both set, the result may not contain all input ids if to do so would produce more rows than the limit. This defeats one of the purposes of "validate", namely to ensure that all input ids appear in the output.
If "count" and "validate" both set, the result is the number of output rows including ones added by "validate", ie, rows with in which all output columns are NULL.
If "validate" and 'filters' both set, the result may contain input ids excluded by the filter. These rows will have NULLs in all output columns.
If no output idtypes are specified, the output will contain one row for each valid input id (by default) or one row for each id whether valid or not (if "validate" is set).
Comparisons are case insensitive. This includes input ids, filters, and internal comparisons used to join database tables. For example, when translating the gene symbol 'HTT' (the human Huntington Disease gene), you will also get information on gene symbol 'Htt' (the mouse and rat ortholog of the human gene) assuming, of course, this information is in the database.
The "filters" argument is typically a HASH or ARRAY of idtypes and conditions on those idtypes. See "Summary of filters argument" for a full description of what can be used as the "filters" argument. The idtypes can be names of IdType objects or objects themselves. The conditions can be ids or SQL fragments. We process this information to create SQL expressions that can be used in the WHERE clause of the query generated by "translate".
Here are some examples.
filters=>{chip_affy=>'hgu133a'} filters=>{chip_affy=>['hgu133a','hgu133plus2']} filters=>{chip_affy=>[\"LIKE 'hgu%'",'mgu74a'],pathway_kegg_id=>4610} filters=>{chip_affy=>[\"LIKE 'hgu%'",\"LIKE 'mgu74a'"],pathway_kegg_id=>undef} filters=>{chip_affy=>'hgu133a',pathway_kegg_id=>[undef,4610]}
If the argument is an ARRAY, it is possible for the same idtype to appear multiple times. If so, the conditions are combined. For example,
filters=>[chip_affy=>'hgu133a',chip_affy=>'hgu133plus2']
is equivalent to
filters=>{chip_affy=>['hgu133a','hgu133plus2']}
If the filter condition is an empty ARRAY, ie, [], the result will be empty.
If the condition contains multiple ids, we combine them into a single SQL IN clause. For example, this idtype=>conditions pair
chip_affy=>['hgu133a','hgu133plus2']
generates this SQL
chip_affy IN ('hgu133a','hgu133plus2')
As noted in "Notes on translate", comparisons are case insensitive.
If a filter condition is undef, all ids of the given type are acceptable. This limits the output to rows for which the filter type is not NULL. This usage is analogous to what it means for "input_ids" to be undef. For example,
$babel->translate(input_idtype=>'gene_entrez', filters=>{pathway_kegg_id=>undef}, output_idtypes=>[qw(gene_symbol)])
generates a table of all Entrez Gene ids and gene symbols which appear in any KEGG pathway.
Including undef in an ARRAY has the opposite meaning: it lets the output contain rows for which the filter type is NULL. For example,
$babel->translate(input_idtype=>'gene_entrez', filters=>{pathway_kegg_id=>[undef,4610]}, output_idtypes=>[qw(gene_symbol)])
generates a table of all Entrez Gene ids and gene symbols which either appear in KEGG pathway 4610 or appear in no KEGG pathway.
It may seem strange for undef to have opposite meanings depending on context, but it is "the right thing" in practice.
An empty SQL fragment, ie, \"", means FALSE. If that's the only condition for a given type, the result will be empty. If there are other conditions, eg, we have an ARRAY of conditions, the empty SQL fragment has no effect, because an ARRAY represents the OR of its elements and ORing FALSE to anything is a nop.
Each idtype=>condition pair generates a Data::Babel::Filter object. See "Details on conditions". In brief, a condition can be
a single id, eg, chip_affy=>'hgu133a'
a single SQL fragment, eg, chip_affy=>\"LIKE 'hgu133a'". The '\' before the first quote generates a reference to the string, which is what tells the software you want a SQL fragment instead of an id.
Data::Babel::Filter object. Not terribly useful in practice but included for completeness.
undef, eg, pathway_kegg_id=>undef. This means that all ids of the filter idtype are acceptable and only excludes rows for which the filter idtype is NULL.
ARRAY of the above. The general effect is to OR the elements of the ARRAY. The exception is undef: in an ARRAY, undef means that NULL is acceptable. For example
chip_affy=>[\"LIKE 'hgu%'",'mgu74a']
means
chip_affy LIKE 'hgu%' OR chip_affy = 'mgu74a'
and
pathway_kegg_id=>[undef,4610]
pathway_kegg_id IS NOT NULL OR pathway_kegg_id = 4610
The "filters" argument as a whole generally contains multiple idtype=>condition pairs. Each generates a Data::Babel::Filter object and the ensemble generates an ARRAY of these objects. The semantics is to AND these together. For example,
chip_affy=>[\"LIKE 'hgu%'",'mgu74a'], pathway_kegg_id=>[undef,4610]
(chip_affy LIKE 'hgu%' OR chip_affy = 'mgu74a') AND (pathway_kegg_id IS NOT NULL OR pathway_kegg_id = 4610)
In succinct terms: we OR the conditions for each individual idtype and AND the conditions across different idtypes.
Filter conditions can contain arbitrarily complex SQL fragments, although we expect most cases to be simple. Simple cases, like the examples above, use the filter's idtype in a single condition. The next step in complexity is to use the idtype in multiple conditions. Here's an example.
chip_affy=>\"LIKE 'hgu%' AND : != 'hgu133b'"
The ':' after the 'AND' is called an embedded idtype marker and tells the code to insert the filter idtype at that point in the query. The ':' is optional at the beginning of the SQL fragment, and we've omitted it in all examples so far. It is also possible to spell out the idtype name after the ':' as in this example
chip_affy=>\"LIKE 'hgu%' AND :chip_affy != 'hgu133b'"
The next jump in complexity is to use multiple idtypes in the same SQL fragment. For example,
chip_affy=>\"(:chip_affy LIKE 'hgu%' AND :chip_affy != 'hgu133b') AND (:pathway_kegg_id IS NOT NULL OR :pathway_kegg_id = 4610)"
Looking at this example, you might wonder why we need to bother with the filter idtype ('chip_affy=>'). Indeed, once you decide to spell out the idtypes in the SQL, there is no need to specify the filter idtype. This leads to two special cases:
''=>\"(:chip_affy LIKE 'hgu%' AND :chip_affy != 'hgu133b') AND (:pathway_kegg_id IS NOT NULL OR :pathway_kegg_id = 4610)"
Bear in mind that you can only have one of these! Also remember that using 'undef' as the key may not work as expected, because Perl automatically quotes the word on the left hand side of the '=>' operator.
filters=>"(:chip_affy LIKE 'hgu%' AND :chip_affy != 'hgu133b') AND (:pathway_kegg_id IS NOT NULL OR :pathway_kegg_id = 4610)"
An empty string or a reference to an empty string means "no filter". This is the same as any other "empty" argument to "filters". Bear in mind that an empty SQL fragment, ie, \"", in any other context means FALSE.
It should be clear that all the filter syntax we've presented up to this point is mere sugar coating for this case. If you're happy writing this sort of SQL, you can skip the rest.
CAUTION: We do not parse the SQL! Syntax errors will be caught by the DBMS and will generate error messages that may not be entirely intuitive. Sorry. If you include an idtype without the ':' mark, we won't see it and may not generate enough joins to connect the idtype to the rest of the query.
The "filters" argument can be
HASH with elements of the form
idtype=>conditions ''=>conditions
For example,
filters=>{ chip_affy=>['hgu133a','hgu133plus2',\"LIKE 'mgu%'","!= 'hgu133b'], pathway_kegg_id=>[4610,undef], ''=>\":gene_symbol LIKE 'casp%' OR :gene_description LIKE '%apoptosis%'"}
It is probably an error for the HASH to contain duplicate idtypes, because the last one is the only one that sticks.
ARRAY with elements of the form
filters=>[ chip_affy=>['hgu133a','hgu133plus2'], chip_affy=>\"LIKE 'mgu%'", chip_affy=>["!= 'hgu133b'], pathway_kegg_id=>[4610,undef], ''=>\":gene_symbol LIKE 'casp%' OR :gene_description LIKE '%apoptosis%'"]
It is fine for the ARRAY to contain duplicate idtypes, in which case their conditions are merged.
string or reference to string, for example
"translate" automatically applies histories, when they exist, to input and filter ids. In other words, input and filter ids can be ones that were valid in the past but are not valid now. Output ids, however, are always current.
CAUTION: If the input type is also used as an output, the result can contain rows in which the output id does not equal the input id. This will occur if the input id is old and is mapped to a different current value. Likewise, if a filter type is used as an output, the result can contain rows in which the output id does not match the filter.
Title : count Usage : $number=$babel->count (input_idtype=>'gene_entrez', input_ids=>[1,2,3], filters=>{chip_affy=>'hgu133a'}, output_idtypes=>[qw(transcript_refseq transcript_ensembl)]) Function: Count number of output rows that would be generated by "translate" Returns : number Args : same as "translate"
"count" is a wrapper for "translate" that sets the "count" argument to a true value.
Title : validate Usage : $table=$babel->validate (input_idtype=>'gene_entrez', input_ids=>[1,2,3]) Function: Tell which input ids are valid now or in the past, and the mapping from old to current values Returns : table represented as an ARRAY of ARRAYS. Each inner ARRAY is one row of the result. If output_idtypes is omiited (the usual case), the elements of each row are 0) input id as given 1) validity status. 1 for valid; 0 for invalid 2) current value of the id or undef if it has no current value; may be the same as the original id If output_idtypes is set, the result is ther same as "translate" with the "validate" option set Args : same as "translate"
"validate" looks up the given input ids in the Master tables for the given input type and returns a table indicating which ids are valid. For types with history information, the method also indicates the current value of the id. For types that have no history, the current value will always equal the given id if the id is valid.
"validate" can also retrieve a complete table of valid ids (along with history information) for the type.
"validate" is a wrapper for "translate" that (1) sets the "validate" argument to a true value and (2) sets the output_idtypes argument to the input_idtype unless the user explicitly set it. All other "translate" arguments (filters, count) are legal here and work but are of dubious value.
For rows whose validity status is 1 (valid), the given id and current value indicate the history: if the elements are equal, the given id is current; else if the current value is defined, the given id has been replaced by the new one; else the given id was valid in the past but has no current value.
For types that have no history, all valid ids are current. If the given id is valid, the given id and current value will be equal; else the current value will be undef.
For rows whose status is 0 (invalid), the current value will always be undef.
The "translate" arguments 'filters' and "count" are legal here and work but are of dubious value.
Title : show Usage : $babel->show Function: Print object in readable form Returns : nothing useful Args : none
Title : show_schema_graph Usage : $babel->show_schema_graph('schema.sif','sif') Function: Emit schema graph in text or sif format Returns : nothing useful Args : file output filename. default: standard out format 'sif' or 'txt'. default: 'sif'
The current version is a minimal implementation, little more than a proof of concept.
At present, a website that can display the sif output is http://cytoscapeweb.cytoscape.org/demo.
Title : check_schema Usage : @errstrs=$babel->check_schema -- OR -- $ok=$babel->check_schema Function: Validate schema. Presently checks that schema graph is tree and all IdTypes contained in some MapTable Returns : in array context, list of errors in scalar context, true if schema is good, false if schema is bad Args : none
Title : check_contents Usage : @errstrs=$babel->check_contents -- OR -- $ok=$babel->check_contents Function: Validate contents of Babel database. Checks consistency of explicit Masters and MapTables Returns : boolean Args : none
Title : load_implicit_masters Usage : $babel->load_implicit_masters Function: Creates database structures for implicit Masters. Returns : nothing useful Args : none
Babel creates 'implicit' Masters for any IdTypes lacking explicit ones. An implicit Master has a list of valid identifiers and could be implemented as a view over all MapTables containing the IdType. In the current implementation, we use views for IdTypes contained in single MapTables but construct actual tables for IdTypes contained in multiple MapTables.
This method must be called after the real database is loaded.
Objects have names and ids: names are strings like 'gene_entrez' and are unique for a given class of object; ids have a short form of the type prepended to the name, eg, 'idtype:gene_entrez', and are unique across all classes. We use ids as nodes in schema and query graphs. In most cases, applications should should use names.
The methods in this section map names or ids to component objects, or (as a trivial convenience), convert ids to names.
Title : name2idtype Usage : $idtype=$babel->name2idtype('gene_entrez') Function: Get the IdType object given its name Returns : Data::Babel::IdType object or undef Args : name of object Notes : only looks at this Babel's component objects
Title : name2master Usage : $master=$babel->name2master('gene_entrez_master') Function: Get the Master object given its name Returns : Data::Babel::Master object or undef Args : name of object Notes : only looks at this Babel's component objects
Title : name2maptable Usage : $maptable=$babel->name2maptable('maptable_012') Function: Get the MapTable object given its name Returns : Data::Babel::MapTable object or undef Args : name of object Notes : only looks at this Babel's component objects
Title : id2object Usage : $object=$babel->id2object('idtype:gene_entrez') Function: Get object given its id Returns : Data::Babel::IdType, Data::Babel::Master, Data::Babel::MapTable object or undef Args : id of object Notes : only looks at this Babel's component objects
Title : id2name Usage : $name=$babel->id2name('idtype:gene_entrez') -- OR -- $name=Data::Babel->id2name('idtype:gene_entrez') Function: Convert object id to name Returns : string Args : id of object Notes : trival convenience method
A Data::Babel::IdType object represents a type of identifier.
Title : new Usage : $idtype=new Data::Babel::IdType name=>$name,... Function: Create new Data::Babel::IdType object or fetch existing object from database and update its components. Store the new or updated object. Returns : Data::Babel::IdType object Args : any attributes listed in the attributes section below, except 'id' (because it is computed from name) old existing Data::Babel object in case program already fetched it (typically via 'old') autodb Class::AutoDB object for database containing Babel. class method often set before running "new" Notes : 'name' is required. All other args are optional
Title : old Usage : $idtype=old Data::Babel::IdType($name) -- OR -- $babel=old Data::Babel::IdType(name=>$name) Function: Fetch existing Data::Babel::IdType object from database Returns : Data::Babel::IdType object or undef Args : name of Data::Babel::IdType object, eg, 'gene_entrez' if keyword form used, can also specify autodb to set the corresponding class attribute
Title : degree Usage : $number=$idtype->degree Function: Tell how many Data::Babel::MapTables contain this IdType Returns : number Args : none
name eg, 'gene_entrez' id name prefixed with 'idtype', eg, 'idtype:::gene_entrez' master Data::Babel::Master object for this IdType maptables ARRAY of Data::Babel::MapTable objects containing this IdType external boolean indicating whether this is a regular external ID or one intended for internal use internal opposite of external history boolean indicating whether this IdType's Master contains history information tablename name of this IdType's Master's table display_name human readable name, eg, 'Entrez Gene ID'; for internal identifiers, a warning is appended to the end referent the type of things to which this type of identifier refers defdb the database, if any, which assigns identifiers meta meta-type: eid (meaning synthetic), symbol, name, description format Perl format of valid identifiers, eg, /^\d+$/ perl_format synonym for format sql_type SQL data type, eg, INT(11)
A Data::Babel::Master object represents the database table underlying a Data::Babel::IdType. The table contains
Title : new Usage : $master=new Data::Babel::Master name=>$name,idtype=>$idtype,... Function: Create new Data::Babel::Master object or fetch existing object from database and update its components. Store the new or updated object. Returns : Data::Babel::Master object Args : any attributes listed in the attributes section below, except 'id' (because it is computed from name) old existing Data::Babel object in case program already fetched it (typically via 'old') autodb Class::AutoDB object for database containing Babel. class method often set before running "new" Notes : 'name' is required. All other args are optional
Title : old Usage : $master=old Data::Babel::Master($name) -- OR -- $babel=old Data::Babel::Master(name=>$name) Function: Fetch existing Data::Babel::Master object from database Returns : Data::Babel::Master object or undef Args : name of Data::Babel::Master object, eg, 'gene_entrez' if keyword form used, can also specify autodb to set the corresponding class attribute
Title : degree Usage : $number=$master->degree Function: Tell how many Data::Babel::MapTables contain this Master's IdType Returns : number Args : none
name eg, 'gene_entrez_master' id name prefixed with 'master::', eg, 'master:::gene_entrez_master' idtype Data::Babel::IdType object for which this is the Master implicit boolean indicating whether Master is implicit explicit opposite of implicit view boolean indicating whether Master is implemented as a view history boolean indicating whether Master contains history information. tablename synonym for name inputs, namespace, query DEPRECATED - intended for use by our database construction procedure but not actually used
Data::Babel::MapTable objects represent the database tables that implement the mapping.
Title : new Usage : $maptable=new Data::Babel::MapTable name=>$name,idtypes=>$idtypes,... Function: Create new Data::Babel::MapTable object or fetch existing object from database and update its components. Store the new or updated object. Returns : Data::Babel::MapTable object Args : any attributes listed in the attributes section below, except 'id' (because it is computed from name) old existing Data::Babel object in case program already fetched it (typically via 'old') autodb Class::AutoDB object for database containing Babel. class method often set before running "new" Notes : 'name' is required. All other args are optional
Title : old Usage : $maptable=old Data::Babel::MapTable($name) -- OR -- $babel=old Data::Babel::MapTable(name=>$name) Function: Fetch existing Data::Babel::MapTable object from database Returns : Data::Babel::MapTable object or undef Args : name of Data::Babel::MapTable object, eg, 'gene_entrez' if keyword form used, can also specify autodb to set the corresponding class attribute
name eg, 'gene_entrez_master' id name prefixed with 'maptable', eg, 'maptable:::gene_entrez_master' idtypes ARRAY of Data::Babel::IdType objects contained by this MapTable inputs, namespace, query DEPRECATED - intended for use by our database construction procedure but not actually used
A Data::Babel::Filter object represents a condition limiting the output of a Data::Babel "translate", "validate", or "count" query. (Hereafter, we will refer only to "translate", but everything applies to the other methods, too). In typical usage, code in Data:Babel generates Filter objects automatically based on the "filters" argument to "translate". Application code rarely needs to create these objects directly.
Recall that the "filters" argument to "translate" typically consists of idtype=>conditions pairs. The code generates a Filter object for each of these pairs.
A mature Data::Babel::Filter object has two attributes of interest:
Title : new Usage : $filter=new Data::Babel::Filter babel=>$babel, filter_idtype=>'gene_symbol', conditions=>["Htt",\"LIKE 'casp%'"] Function: Create new Data::Babel::Filter object representing the coditions. Returns : Data::Babel::Filter object Args : babel Data::Babel object connected to database condtions see below filter_idtype default IdType for conditions. Can be name of Data::Babel::IdType object or object. filter_idtypes ARRAY of IdTypes used by conditions. Can be names of Data::Babel::IdType objects or objects allow_sql boolean. Controls whether it is legal to pass SQL fragments to 'new'. default: true embedded_idtype_marker perfix marking an embedded IdType name. default: ':' treat_string_as indicator telling how to interpret strings in "conditions" Choices: 'id', 'sql', undef (restores default). default: 'id' treat_stringref_as indicator telling how to interpret references to strings in "conditions" Choices: 'sql', 'id', undef (restores default). default: 'sql' prepend_idtype indicator telling when to prepend the "filter_idtype" arg to SQL fragments contained in "conditions". Choices: 'auto', other true value, any false value. See below. default: 'auto' Notes : babel and conditons required. filter_idtype also usually specified. Other args rarely used.
All arguments to "new" plus
sql the SQL expression generated by "new"
In the mature object - ie, upon completion of "new" - the important attributes are "sql" and "filter_idtypes". At this point, the attributes will contain the following:
In typical usage, "conditions" is the right hand side of an "idtype=>conditions" pair in the "filters" argument to "translate". It can also be the complete value of the "filters" argument when it is set to a string or string reference. "conditions' can contain ids that are combined for use in SQL IN clauses or fragments of actual SQL.
SQL fragments may contain 'embedded IdTypes'. These are IdTypes names prefixed by a marker, typically ':', for example ':gene_symbol' (without the quotes!!). An example of such a fragment is
:gene_symbol LIKE 'casp%' OR :gene_symbol = 'Htt'
If "filter_idtype" is set (it usually is), you can use the marker without the name to denote the "filter_idtype". For example, if "filter_idtye" is "gene_symbol", you could write the previous SQL fragment as
: LIKE 'casp%' OR : = 'Htt'
By default, we treat string conditions as ids, and references to strings as SQL fragments. You can change this via the "treat_string_as" and "treat_stringref_as" arguments.
By default, we prepend the "filter_idtype" argument to SQL fragments unless an embedded IdType (with or without a name) is the first thing in the fragment. For example, if "filter_idtye" is "gene_symbol", you can express the SQL clause
gene_symbol LIKE 'casp%'
with any of these fragments
LIKE 'casp%' : LIKE 'casp%' :gene_symbol LIKE 'casp%'
You can change the prepending behavior via the "prepend_idtype" argument. If set to any true value except 'auto' we always prepend, and if false we never prepend.
The "conditions" argument can contain arbitrarily complex SQL, but we expect most cases to be simple. Simple cases, like the example above, use the "filter_idtype" in a single condition. In such cases, you don't need to use embedded IdTypes. You can also express ORs of simple conditions without resorting to embedded IdTypes by putting the conditions in an ARRAY; see "Details on conditions". For example, if "filter_idtye" is "gene_symbol", you can express the SQL clause
gene_symbol LIKE 'casp%' OR gene_symbol LIKE 'il%'
with this Perl construct
[\"LIKE 'casp%'", \"LIKE 'il%'"]
The "conditions" argument may be one or an ARRAY of the following. An ARRAY represents the OR of its elements.
string. Id or SQL fragment depending on the value of the "treat_string_as" argument. The default is "id".
reference to string. SQL fragment or id depending on the value of the "treat_stringref_as" argument. The default is "SQL".
Data::Babel::Filter object. This has little utility by itself, but in an ARRAY it causes the old object to be ORed onto the rest of the conditions.
undef. When used standalone, it is equivalent to the SQL fragment "IS NOT NULL". In typical cases this means that all ids of "filter_idtype" are acceptable, similar to what it means for "input_ids" to be undef in "translate". In an ARRAY it has the opposite meaning, which in typical usage lets the output contain rows for which "filter_idtype" is NULL. It may seem strange for undef to have opposite meanings depending on context, but is natural in practice.
If "conditions" contains multiple ids, we combine them into a single SQL IN clause. For example, if "filter_idtye" is "gene_symbol" and "conditions" is ['Htt','Casp6','Ins2'], we generate
gene_symbol IN ('Htt','Casp6','Ins2')
The "conditions" argument can be "empty" in several ways.
empty id. This is completely normal and generates SQL to match an empty string.
empty SQL fragment, typically encoded as a reference to an empty string, ie, \''. This generates the SQL FALSE. When used in an ARRAY, this has no effect, because an ARRAY represents the OR of its elements and ORing FALSE to anything is a nop.
empty ARRAY. Generates FALSE.
To process complex SQL conditions, we need to identify the IdTypes used by the condition for two purposes. (1) "translate" needs these to to find the MapTables it must join to connect the IdTypes, and (2) for IdTypes with histories, we have to prepend the IdType name with '_X_' whenever the IdType is compared to a constant so that the history mapping will be applied.
To do this without embedded IdType markers, we would need to find or develop a SQL parser that creates a parse tree that we can examine to find the IdTypes, modify to handle histories, and convert back to SQL after being modified. Because SQL parsing is technically challenging, maintainability is a crucial concern: it would be very unpleasant to incorporate a module that works for our purposes today but ceases to do so in a future release.
We investigated several CPAN modules that do SQL parsing.
SQL::Statement seems to be under active development. The SQL dialect it supports is incomplete but probably adequate for our needs. The parse tree it produces is easy to work with but is not documented and presumably might change in future releases. The showstopper is that it has no method for converting the parse tree back to SQL.
DBIx::MyParsePP does most of what we want but hasn't been updated in years, which raises worries about long term maintainability. It implements the MySQL 5.0 SQL dialect, which is fine for our purposes. It is slow to load because the grammar is big,
DBIx::MyParse is a C implementation of DBIx::MyParsePP. We didn't test it, because it requires access to MySQL source, which you have to patch (!!). It seems unlikely that this could be incorporated into the normal CPAN installation process.
SQL::Abstract::Parser looks pretty good, but the documentation cautions against relying on the structure of the parse tree at this point. This might offer a future solution when the developers declare the parse tree format to be stable.
We considered developing our own parser for a mini-SQL dialect limited to clauses separated by AND, OR, and NOT. Even this is hard because of SQL constructs like 'BETWEEN n1 AND n2'.
Our conclusion is that while it is inelegant to require embedded IdType markers, this is the only practical solution at present.
I'm not aware of anything.
Nat Goodman, <natg at shore.net>
<natg at shore.net>
Please report any bugs or feature requests to bug-data-babel at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Data-Babel. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-data-babel at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc Data::Babel
You can also look for information at:
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Data-Babel
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Data-Babel
CPAN Ratings
http://cpanratings.perl.org/d/Data-Babel
Search CPAN
http://search.cpan.org/dist/Data-Babel/
This module extends a version developed by Victor Cassen.
Copyright 2012 Institute for Systems Biology
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
To install Data::Babel, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Data::Babel
CPAN shell
perl -MCPAN -e shell install Data::Babel
For more information on module installation, please visit the detailed CPAN module installation guide.