The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    FAQ - Frequently Asked Questions about SenseClusters

SYNOPSIS
    Answers to some frequently asked questions about SenseClusters

DESCRIPTION
  Functionality related questions :
   What are contexts ?
    Contexts are units of text whose categorization you are interested in.
    For example, if you have a bunch of emails which you would like to
    categorize then each email will become a context and SenseClusters will
    try to cluster these contexts (emails) into separate clusters.

   How small or large can a context be ?
    A context can be as small as a phrase/sentence or as large as a complete
    document.

   What applications can SenseClusters be used for ?
    SenseClusters can be applied to any problem where you have a set of
    text-units (contexts) which you would like to cluster into separate
    groups based on their (dis)similarity. Few such applications are listed
    below:

     * Word Sense Discrimination
     * Proper Name Discrimination 
     * Email Clustering
     * Word Clustering
     * Document clustering

  Package related Questions :
   Can I use SenseClusters without installing all the other packages like NSP, SVDPACK, CLUTO ?
    Some packages are required while others are optional. You will always
    need NSP, Math::SparseVector, Algorithm::Munkres, PDL and Cluto. If you
    plan to carry out Singular Value Decomposition to reduce the feature
    dimensions, SVDPACK is required.

    Additionally, you will also need the Bit::Vector and Set::Scalar modules
    if you decide to select the --binary option in wrappers or run
    bitsimat.pl.

    Technically speaking then, NSP, PDL, Math::SparseVector, Cluto and
    Algorithm::Munkres are required while SVDPACK is recommended.
    Bit::Vector and Set::Scalar are optional. However, we strongly recommend
    that all of these be installed to take full advantage of the package.

  Questions on DATA Formatting :
   Is there any specific reason why you prefer Senseval-2 format ?
    Yes. There are various data formats and it gets very complicated and
    confusing if we don't specify one as our standard. Senseval-2 format is
    very simple and does a nice job to put together the information like
    instance and sense ids along with the actual context data. An example of
    Senseval-2 formatted file follows:

     <?xml version="1.0" encoding="iso-8859-1" ?>
     <!DOCTYPE corpus SYSTEM  "lexical-sample.dtd">
     <corpus lang='english' tagged="NO">
     <lexelt item="art.n">
     <instance id="art.40025" docsrc="bnc_ASY_548">
     <answer instance="art.40025" senseid="arts%1:09:00::"/>
     <context>
     i would therefore argue that one of the chief tasks of education perhaps its overriding task is the education and encouragement of a child's imagination so that he may not be a slave to a perception confined solely to the present a perception that is little more than blindness the teaching of history is one part of such an education the encouragement of creativity is another and there are others still after primary school the encouragement of the imagination in children and the cultivation of specifically creative activities has often been thought an optional part of the curriculum a luxury that may have to be dispensed with left in if at all for the less able pupils deemed incapable of serious learning or for that minority determined to reject understanding imagination has been associated especially with the <head>arts</head> and thus in recent years has been increasingly downgraded 
     </context>
     </instance>
     <instance id="art.40028" docsrc="bnc_A0P_1561">
     <answer instance="art.40028" senseid="arts%1:09:00::"/>
     <context>
     it was not all going to be wine and roses and leonard again felt the sharp problem of the canadian writer at that time having a small home market not wishing to become artistically part of the of america and yet having nowhere else to go as layton said still regret that we got no encouragement from the cbc because i think that we would have gone on to write plays so it was that they went on to do other things but separately a working relationship was thereby broken up and two highly creative thinkers had their play writing ambitions stillborn this was not in fact the end of leonard's ambitions in that regard in flowers for hitler pp ff he published his new a ballet drama in one act and an involvement in film making would help to sublimate it as we shall see leonard did however manage to get a grant from the canadian <head>arts</head> council 
     </context>
     </instance>
     <instance id="art.40031" docsrc="bnc_CAF_1966">
     <answer instance="art.40031" senseid="art%1:06:00::"/>
     <context>
     lush life is your perennial lost soul an individual living on the marginal precipice of society and veering quickly towards a grim finale he is in short the perfect nik cohn character from the outset cohn makes it very clear that he is bent on taking a walk on the wild side to chronicle the lives of the losers he meets along this small strip of the great american nowhere and so the heart of the world is ultimately a collection of encounters of stories stories like that of the wall street broker who is currently sweeping the streets as part of his sentence for possessing controlled substances or the black broadway actor who always fears ending up in the gutter or the refugee from china who develped a passion for a stone woman from county clare or the iranian <head>art</head> merchant who once defaced picasso's guernica with a spray can 
     </context>
     </instance>
     <instance id="art.40034" docsrc="bnc_A5J_102">
     <answer instance="art.40034" senseid="art%1:06:00::"/>
     <context>
     kate foster of the halkin arcade belgravia the scholarly specialist in continental porcelain found the opposite suppose we've got to get used to rich people coming along with their she said met new clients which is what i went for in the oriental field bluett's failed to find buyers for a group of chinese warring states bronze vessels of extraordinary scholarly interest while colnaghi oriental found spectacularly decorative seventeenth century bronzes easy to move american buyers depend on advisers whether decorators or art specialists to an extent unparalleled in europe they read publications that keep them abreast of market trends personalities lawsuits and prices rather than delving into <head>art</head> history as do their counterparts in europe 
     </context>
     </lexelt>
     </corpus>

   How should I get my data in Senseval-2 format ?
    We provide a preprocessing program text2sval.pl in
    Toolkit/preprocess/plain that converts data in plain text format into
    Senseval-2 format. Data in any other format has to be converted to
    Senseval-2 format to use SenseClusters.

   What if I am using SenseClusters for an application like email sorting where my data instances are emails and don't have any target words ?
    In version 0.53, we call this as a global mode in which training and/or
    test data could be generic and are not the instances of a specific
    target word. The new modified wrapper discriminate.pl handles this case
    automatically. It first checks if the target.regex is provided by the
    user or if it exists in the current directory. If not, it tries to
    create the target.regex automatically by searching all <head> tags in
    the test data. If there are no <head> tags found in test, it assumes
    that the test data is global and treats it differently from local
    (target-specific) data. For example, co-occurrence features are not
    supported in global mode, scope option will not work if the
    corresponding train/test file is global and so on ...

   Can I use multiple target words in the context of the same instance ?
    No, SenseClusters will allow only single target word per instance. But
    you can handle this situation by duplicating same instance with
    different target words. This makes most sense when you specify --scope
    option that considers only few words around the target word. So when you
    duplicate same instance with different target words, hopefully they will
    have different contexts.

   I can't use setup.pl because it splits data into different files according to their <lexelt> values. What if I want to discriminate whole data that has multiple <lexelt> tags ?
    There are 2 ways to handle this kind of situation.

    1. Use single <lexelt> tag with any item value, say <lexelt
    item="MULTI"> and to retain the lexelt information, append the original
    lexelt value to the instance ids so after clustering you will know which
    instances belong to which lexelt item.

    OR

    2. Let setup split the data on <lexelt> values and then concatenate the
    split results into a single data file. Note that, the concatenation of
    XML like Senseval-2 files is tricky and make sure to remove the header
    information (<xml>, <corpora> etc tags) before each <lexelt> tag except
    the first one and all footer tags (</corpora>) after each </lexelt> tag
    except the last one.

    We plan to distribute some of our own scripts that combine two
    Senseval-2 formatted xml files into a single valid Senseval-2 file, in
    order to support the multi-lexelt issue.

  Questions on FEATURES :
   What are second order co-occurrences ?
    These are the words that co-occur with co-occurrences of the target
    word. e.g. if the given bigram file includes bigrams like -

     telephone<>line<>
     product<>line<>
     market<>product<>
     telephone<>service<>

    then, telephone and product directly co-occur with line and hence become
    the (first order) co-occurrences of line, while, market and service
    co-occur with product and telephone resp. and hence become the second
    order co-occurrences as they are indirectly related to line.

   I see target word in the features file. How can I exclude target words from my feature set ?
    By default, SenseClusters doesn't make an attempt to exclude the target
    word but there is an option --extarget in programs wordvec.pl and
    order1vec.pl that will omit the target word from feature set and hence
    avoiding them as features while creating word and context vectors.

  Questions on SVD:
   I see various files at <http://netlib.org/svdpack/> - which should I download to use SenseClusters ?
    Last item svdpackc.tar.gz ! This is a C implementation of SVDPACK.

   I get following error when I do "make las2"...
            make las2
            gcc -O -c las2.c
            las2.c:1365: conflicting types for "random"
            /usr/include/stdlib.h:397:previous declaration of "random"
            make: *** [las2.o] Error1

    Any ideas why ?

    Yes. Modify the Makefile distributed with SVDPACKC to use ANSI C
    compiler. You can fix this by changing CC = gcc line in Makefile to CC =
    gcc -ansi

   When I run las2, I get an error message "cannot open file matrix for reading".
    las2 requires that the input matrix is in the same directory where you
    run las2 and has name "matrix". A quick test is to - 1. gunzip
    belladit.Z # this is a sample matrix distributed with SVDPACKC 2. cp
    belladit matrix # copy belladit as matrix 3. las2 # running las2 on
    matrix

    This should create 2 output files lao2 (text file) and lav2 (binary
    file). You can take a quick look at lao2 and make sure that there are no
    error messages in it.

   How do I install SVDPACK ?
    The INSTALL file distributed in SenseClusters' main package directory
    includes detailed instructions on installing and running a sample test
    on SVDPACK.

   How do I set values of parameters in file lap2 and las2.h ?
    A detailed help on setting values of various parameters in files lap2
    and las2.h is provided in the perldoc of program mat2harbo.pl
    distributed with SenseClusters. Type "man mat2harbo.pl" to see this
    documentation.

   I see some errors when running the test scripts for svdpackout.pl It looks like svd/las2 is producing "reasonable" results, but those seem to be different than what the test case is expecting. What's happening?
    SVDPACKC produces somewhat different results on different hardware
    platforms, and can sometimes produce different results from run to run
    on the same platform due to certain random choices that the algorithm
    makes.

    Our test cases for svdpackout.pl are constructed using a Linux computer
    that is running an i686/SMP flavor of Linux. If you run the test scripts
    on a different platform, you may get different (but equally good)
    results.

    One option we are considering for future releases is to have a separate
    set of tests for svd/las2 that will create the expected results for the
    test cases for svdpackout.pl. If SVDPACKC is not installed properly this
    could lead to these test cases failing, but if that is the case then
    there are bigger problems anyway.

   I am getting format errors from svdpackout.pl. What's the problem?
    There are several possibilities. One is that you may need to add more
    precision or otherwise adjust the format to store the results from
    SVDPACK.

    Also note that las2 produces output in a binary format that
    svdpackout.pl "decodes". If you have run las2 on a different platform
    than what you are running svdpackout.pl on, you can also see these
    format errors.

  Questions on Clustering:
   I see some segmentation faults while running CLUTO's vcluster and scluster programs. Is there anything going wrong there ?
    A clustering program will not function properly if the vector/similarity
    matrix that we input to it is highly sparse. We especially notice this
    problem while running agglomerative algorithm in vector space that shows
    segmentation fault errors when the input context vectors are very very
    sparse.

    In our next version, we plan to show a warning message that will notify
    the user when the context vectors going to the clustering program are
    too sparse to work normally. This problem occurs if the training data
    given is very small and so if the feature set created from it.

   How do I know how many clusters to create ?
    SenseClusters now supports four different cluster stopping measures that
    each try to predict the appropriate number of clusters for the given
    data. Refer to http://www.d.umn.edu/~tpederse/Pubs/naacl06-demo.pdf for
    more details.

  Questions on Evaluation:
   In Word Sense Disambiguation, I can see how one can evaluate the accuracy by comparing the sense tags attached by a disambiguation algorithm against the manually attached sense tags. Its not clear to me, how the evaluation works in Discrimination where the algorithm doesn't attach any sense tags.
    There are various ways to evaluate the performance of evaluation. Some
    require the manually attached or true sense tags while others don't.

    1. You can use cluster analysis techniques that report
    intra-cluster/inter-cluster similarity, standard deviation as a measure
    to evaluate performance. Ideally, the result should show you high
    intra-cluster similarity and low inter-cluster similarity. This doesn't
    require the knowledge of true sense tags.

    2. You can also look at the most discriminating features that are
    peculiar characteristic of each cluster and manually judge how they
    describe a unique word meaning of the target word. This too doesn't
    require the knowledge of true sense tags of the instances.

    3. Finally, one can evaluate discrimination performance by providing
    sense tagged data for test purposes and request explicit evaluation by
    selecting --evaluate option in wrapper. This will report the accuracy of
    discrimination in terms of precision and recall. The details on
    evaluation of word sense discrimination can be found in our NAACL-03
    student workshop paper.

SEE ALSO
    <http://senseclusters.sourceforge.net/>

AUTHORS
     Ted Pedersen, University of Minnesota, Duluth
     tpederse at d.umn.edu

     Amruta Purandare
     University of Pittsburgh

     Anagha Kulkarni
     Carnegie-Mellon University

COPYRIGHT
    Copyright (c) 2003-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni

    Permission is granted to copy, distribute and/or modify this document
    under the terms of the GNU Free Documentation License, Version 1.2 or
    any later version published by the Free Software Foundation; with no
    Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

    Note: a copy of the GNU Free Documentation License is available on the
    web at <http://www.gnu.org/copyleft/fdl.html> and is included in this
    distribution as FDL.txt.