The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    CUICollectorMapReduce README

  SYNOPSIS
      CUICollectorMapReduce is a collection of three Java classes designed 
      to parse MetaMap output files to identify CUI bigrams for use by
      the UMLS::Association package.  The Hadoop implementation improves 
      upon the Perl CUICollector.pl module through paralellization of 
      counting CUI bigram frequencies, removal of memory constraints, 
      and identification of CUI bigrams that span utterances having the 
      same PubMed ID.

      The three Java classes are ArticleCollector.java, ArticleSplitter.java,
      and CUICollector.java.
      ArticleCollector parses MetaMap files in order to concatenate all 
      utterances that have the same Pubmed ID into a single record.  
      This single record is then processed by CUICollector.java when in 
      "article" mode. CUICollector.java has 2 modes: "cui" and "article".
  
      In "cui" mode the CUICollector class parses MetaMap output 
      directly and reproduces the original output of the Perl 
      CUICollector.pl algorithm by identifying CUI bigrams in each 
      individual utterance.  In "article" mode the CUICollector requires 
      that ArticleCollector.java be run first, then it parses the output 
      of the ArticleCollector.  
  
      ArticleSplitter is meant to be run on MetaMap output once.  It
      does the same record processing as ArticleCollector, except
      instead of saving everything in one large file it saves each
      collection of utterances in one file per pubmed ID.  These small
      individual files can then be processed by CUICollector in article
      mode.  

      The following sections provide installation instructions and give
      examples of how to use CUICollectorMapReduce.

  REQUIERMENTS
    To install and run CUICollectorMapReduce you must have the following and
    all their dependencies installed:

      Java version 1.8.0

      Apache Hadoop 2.7.3 (install binary from http://hadoop.apache.org/releases.html)

      Apache Maven 3.3.9 (only required if compiling from source)

    A Hadoop cluster with an HDFS is not required to run this software as
    Hadoop can be installed on a single computer by following the directions
    for a Single Node Setup.

    To download and install above software dependencies on a Linux box use:

      >> sudo apt-get install maven git ssh rsync default-jdk default-jre openssh-server

    Once Hadoop is installed edit Line 25 of the file
    hadoop-2.7.3/etc/hadoop/hadoop-env.sh to point to your default-java
    location (in linux it is /usr/lib/jvm/default-java).

    Make sure to add the path to your Hadoop executable to your environment
    $PATH.

    This code has only been test on a Linux platform.

  INSTALL
    Once downloaded you can either use the JAR file directly or recompile.
    To recompile you must have Maven installed. CD to the source directory
    with the pom.xml file and do the following:

      >> mvn install

    This will compile the CUICollectorMapReduce class files and save the
    executable jar file under a new directory named "target".

  USAGE
    To see a list of required options and their descriptions type in the
    command without any options. For example:

      >> hadoop jar <path to jar file> Hadoop.CUICollectorMapReduce.ArticleCollector

   ArticleCollector
    To run the ArticleCollector class on MetaMap machine readable output
    files (MMO files) use the following (>> indicates command line prompt
    and should not be typed):

       >> hadoop jar <path to jar file> Hadoop.CUICollectorMapReduce.ArticleCollector -i <path to metamap directory OR individual file> -o <path and name of output directory>

    For example from within the Hadoop folder:

       >> hadoop jar ./target/CUICollectorMapReduce-0.0.1-SNAPSHOT.jar Hadoop.CUICollectorMapReduce.ArticleCollector -i ./metamap/ -o articleOut

   CUICollector
    To run CUICollector in "cui" mode, which processes MMO files directly
    use the following:

       >> hadoop jar <path to jar file> Hadoop.CUICollectorMapReduce.CUICollector -i <path to metamap directory OR individual file> -o <path and name of output directory> -m cui -w <window step size>

    For example:

       >> hadoop jar ./target/CUICollectorMapReduce-0.0.1-SNAPSHOT.jar Hadoop.CUICollectorMapReduce.CUICollector -i ./metamap/ -o cuiOut -m cui -w 2

    To run CUICollector in "article" mode you need to point the input to the
    ArticleCollector output and replace the "cui" with "article".

    For example:

       >> hadoop jar ./target/CUICollectorMapReduce-0.0.1-SNAPSHOT.jar Hadoop.CUICollectorMapReduce.CUICollector -i ./articleOut/part-r-00000 -o cuiOut -m article -w 2

   CUICollector window step size:
    This value must be entered. A step size of one (1) retrieves all
    consecutive CUI bigrams. A step size of two (2) will retrieve all
    consecutive bigrams plus all bigrams the skip 1 position. For example,
    the CUI sequence of "CUI1 CUI2 CUI3" would return the bigrams CUI1-CUI2,
    CUI2-CUI3 for a step size of 1, and will return the additional bigram of
    CUI1-CUI3 with a step size of 2.

  OUTPUT
    CUICollectorMapReduce outputs all CUI bigrams and frequencies into a
    flat text file; thus, installation of MySQL is not required for this
    part of the package. This text file can be loaded into a MySQL database
    for use with the UMLS::Association package.

  REFERENCING
    If you write a paper that has used UMLS-Association in some way, we'd
    certainly be grateful if you sent us a copy.

  CONTACT US
    If you have any trouble installing and using CUICollectorMapReduce,
    please contact us via the users mailing list :

    umls-association@yahoogroups.com

    You can join this group by going to:

    <http://tech.groups.yahoo.com/group/umls-association/>

    You may also contact us directly if you prefer :

        Bridget T. McInnes: btmcinnes at vcu.edu 
        Amy L. Olex: alolex at vcu.edu

  SOFTWARE COPYRIGHT AND LICENSE
    Copyright (C) 2017 Bridget T McInnes and Amy L. Olex

    This suite of programs is free software; you can redistribute it and/or
    modify it under the terms of the GNU General Public License as published
    by the Free Software Foundation; either version 2 of the License, or (at
    your option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program; if not, write to the Free Software Foundation, Inc.,
    59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

    Note: The text of the GNU General Public License is provided in the file
    'GPL.txt' that you should have received with this distribution.