umls-allwords-senserelate.pl - This program performs all-words word sense disambiguation and assigns senses from the UMLS to each ambiguous term in a runnning text using semantic similarity measures.
umls-allwords-senserelate.pl - This program performs all-words word sense disambiguation and assigns senses from the UMLS to each ambiguous term in a runnning text using semantic similarity or relatedness measures from the UMLS::Similarity package.
Usage: umls-allwords-senserelate.pl [OPTIONS] INPUTFILE
The output files will be stored in the directory "log" or the directory defined by the --log option.
Input file either in all-words xml format indicated by the --awxml option (which is also the default).
The input format is all-words xml, similar to what is found in the all-words disambiguating semeval task. This is the default format.
In this format each line of the text files contains a single word where the words to be disabugated are identified by:
And the context is encapsulated in text tags <text id="id"> ... </text>
<text id="d000"> That <head id="d000.s000.t001">is</head> what the <head id="d000.s000.t004">man</head> had <head id="d000.s000.t006">said</head> . Haney <head id="d000.s001.t001">peered</head> at his <head id="d000.s001.t005">drinking</head> <head id="d000.s001.t006">companion</head> doubtfully . </text>
Please note that the following id format is required:
d[0-9]+ refers to the document id s[0-9]+ refers to the sentence number in the document t[0-9]+ refers to the term number in the sentence
The padding of zeros is optional.
Directory in which the output files will be stored. Default: log
Use the compounds in the input text. For the plain and sval2 format these are indicated by an underscore. For example:
Stores the gold standard information in the <target word>.key file to be used in the evaluation programs. This file is stored in the log directory.
To use this option, the input text must contain the key information in the following format:
<head id="id" sense="sense">target word</head>
<head id="d000.s001.t006" sense="C0000000">companion</head>
The candidate information is embedded in the inputfile in the following format:
<head id="id" candidates="sense1,sense2,sense3">target word</head>
<head id="d001.s001.t001" candidates="C1280500,C2348382">effect</head>
The window in which to obtain the context surrounding the ambiguous term.
Use the MEASURE module to calculate the semantic similarity. The available measure are: 1. Leacock and Chodorow (1998) referred to as lch 2. Wu and Palmer (1994) referred to as wup 3. The basic path measure referred to as path 4. Rada, et. al. (1989) referred to as cdist 5. Nguyan and Al-Mubaid (2006) referred to as nam 6. Resnik (1996) referred to as res 7. Lin (1988) referred to as lin 8. Jiang and Conrath (1997) referred to as jcn 9. The vector measure referred to as vector
Weight the scores based on the distance the content term is from the target word. This option can currently only be used with the --window option.
A file containing a list of words to be excluded. This is used in the UMLS::SenseRelate::TargetWord module as well as the vector and lesk measures in the UMLS::Similarity package. The format required is one stopword per line, words are in regular expression format. For example:
/\b[a-zA-Z]\b/ /\b[aA]board\b/ /\b[aA]bout\b/ /\b[aA]bove\b/ /\b[aA]cross\b/ /\b[aA]fter\b/ /\b[aA]gain\b/
The sample file, stoplist-nsp.regex, is under the samples directory. We might change this to require two different stoplists in the future; one for the senserelate program and the other for the relatedness measures.
This stores the trace information in FILE for debugging purposes.
Displays the version information.
Displays the help information
This is the configuration file. There are six configuration options that can be used depending on which measure you are using. The path, wup, lch, lin, jcn and res measures require the SAB and REL options to be set while the vector and lesk measures require the SABDEF and RELDEF options.
The SAB and REL options are used to determine which sources and relations the path information is to be obtained from. The format of the configuration file is as follows:
SAB :: <include|exclude> <source1, source2, ... sourceN> REL :: <include|exclude> <relation1, relation2, ... relationN>
For example, if we wanted to use the MSH vocabulary with only the RB/RN relations, the configuration file would be:
SAB :: include MSH REL :: include RB, RN
SAB :: include MSH REL :: exclude PAR, CHD
The SABDEF and RELDEF options are used to determine the sources and relations the extended definition is to be obtained from. We call the definition used by the measure, the extended definition because this may include definitions from related concepts.
The format of the configuration file is as follows:
SABDEF :: <include|exclude> <source1, source2, ... sourceN> RELDEF :: <include|exclude> <relation1, relation2, ... relationN>
The possible relations that can be included in RELDEF are:
1. all of the possible relations in MRREL such as PAR, CHD, ... 2. CUI which refers the concepts definition 3. ST which refers to the concepts semantic types definition 4. TERM which refers to the concepts associated terms
For example, if we wanted to use the definitions from MSH vocabulary and we only wanted the definition of the CUI and the definitions of the CUIs SIB relation, the configuration file would be:
SABDEF :: include MSH RELDEF :: include CUI, SIB
Note: RELDEF takes any of MRREL relations and two special 'relations':
1. CUI which refers to the CUIs definition 2. TERM which refers to the terms associated with the CUI
If you go to the configuration file directory, there will be example configuration files for the different runs that you have performed.
For more information about the configuration options (including the RELA and RELADEF options) please see the README.
This option will not create a database of the path information for all of concepts in the specified set of sources and relations in the config file but obtain the information for just the input concept
This option will bypass any command prompts such as asking if you would like to continue with the index creation.
Sets the UMLS-Interface debug flag on for testing
Username is required to access the umls database on mysql
Password is required to access the umls database on mysql
Hostname where mysql is located. DEFAULT: localhost
Database contain UMLS DEFAULT: umls
FILE containing the propagation counts of the CUIs. This file must be in the following format:
where probability is the probability of the concept occurring.
See create-icpropagation.pl for more information.
This is the matrix file that contains the vector information to use with the vector measure.
If you do not want to use the default, this file is generated by the vector-input.pl program. An example of this file can be found in the samples/ directory and is called matrix.
This is the index file that contains the vector information to use with the vector measure.
If you do not want to use the default, this file is generated by the vector-input.pl program. An example of this file can be found in the samples/ directory and is called index.
This prints the vector information to file, FILE, for debugging purposes.
A file containing a list of words to be excluded from the vector measure calculation. This is the same format as the --stopword option.
head3 --leskstoplist FILE
A file containing a list of words to be excluded from the lesk measure calculation. This is the same format as the --stopword option.
This is a dictionary file for the vector or lesk measure. It contains the 'definitions' of a concept or term which would be used rather than the definitions from the UMLS. If you would like to use dictfile as a augmentation of the UMLS definitions, then use the --config option in conjunction with the --dictfile option.
The expect format for the --dictfile file is:
CUI: <definition> CUI: <definition> TERM: <definition> TERM: <definition>
There are three different option configurations that you have with the --dictfile.
1. No --dictfile - which will use the UMLS definitions
umls-allwords-senserelate.pl --measure lesk hand foot
2. --dictfile - which will just use the dictfile definitions
umls-allwords-senserelate.pl --measure lesk --dictfile samples/dictfile hand foot
3. --dictfile + --config - which will use both the UMLS and dictfile definitions
umls-allwords-senserelate.pl --measure lesk --dictfile samples/dictfile --config configuration hand foot
Keep in mind, when using this file with the --config option, if one of the CUIs or terms that you are obtaining the similarity for does not exist in the file the vector will be empty which will lead to strange similarity scores.
An example of this file can be found in the samples/ directory and is called dictfile.
This is a flag for the vector measures. The definitions used are 'cleaned'. If the --defraw flag is set they will not be cleaned.
This is a flag for the vector and lesk method. If the --stem flag is set, definition words are stemmed using the Lingua::Stem::En module.
If you have any trouble installing and using UMLS-Similarity, please contact us via the users mailing list : firstname.lastname@example.org You can join this group by going to: http://tech.groups.yahoo.com/group/umls-similarity/ You may also contact us directly if you prefer : Bridget T. McInnes: bthomson at umn.edu Ted Pedersen : tpederse at d.umn.edu
Bridget T. McInnes, University of Minnesota
Copyright (c) 2010-2012
Bridget T. McInnes, University of Minnesota Twin Cities bthomson at umn.edu Ted Pedersen, University of Minnesota Duluth tpederse at d.umn.edu Serguei Pakhomov, University of Minnesota Twin Cities pakh0002 at umn.edu Ying Liu, University of Minnesota Twin Cities liux0395 at umn.edu
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to:
The Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.