The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    TODO - List of things TODO for SenseClusters

SYNOPSIS
    Plans for future versions of SenseClusters

DESCRIPTION
  Version 1.05 Todo
    *   Add support for ngram cluster labeling to web interface?

    *   Resolve Testing error in wordvec.pl, and also deprecated use of
        defined in keyconvert.pl. These are known issues in 1.03.

    *   Consider the use of SVDLIBC rather than SVDPACKC for SVD, however,
        have been having problems compiling SVDLIBC on 64 bit platforms.

    *   Testing of SVDPACK remains problematic, since results can vary from
        platform to platform. At present we are simply testing to see if
        output is created.

    *   Introduction of CPAN style 'make test' option. Can be coded even for
        command line interfaces, but can be a little messy, especially when
        a program is not producing a single value but rather tables of
        values or formatted text (which is often what we do). Must also do
        this in such a way that it handles different file system (via
        File::Spec probably).

    *   Improve order 1 efficiency. Rather than matching each context
        against every feature (regular expression), match each possible
        feature in context with the features. The simple approach for
        invoking count.pl for each context by creating a temporary file for
        each context suffers from file and process creation overhead. Need
        to investigate NSP APIs and use them so that features can be
        identified from contexts in memory instead of having to create
        temporary files.

    *   Provide simple tools that allow a user to visualize more complicated
        data structures like the 2nd order context vectors. For example,
        right now a user can see the word vectors associated that will be
        averaged together (in the .wordvec file), but they are purely
        numeric. It would be nice to see what words are associated with
        these values.

    *   Make discriminate.pl more modular in its organization, possibly
        through the use of subroutines. Reduce reliance on system calls
        which can lead to portability problems.

    *   Fix --showargs option on discriminate.pl. Has not been working for
        many versions now.

    *   Check to see if checking return values from vcluster and scluster in
        discriminate.pl is really accomplishing anything. Do they return
        error codes and success codes reliably? Any chance for false
        positive or false negative?

    *   Replace default stoplist with a program that automatically generates
        stoplists from a given corpus.

    *   Add version information to SenseClusters web interface, including
        version of SenseClusters and modules it is using.

    *   There are various small utility programs whose return codes are
        checked by discriminate. These programs seem to always return via
        exit, suggesting that their return codes are always for success. May
        want to return a different value for failures so the discriminate.pl
        checks are meaningful.

    *   Reduce the number of regular expressions used in the regex file
        provided to order2vec.pl for feature identification during LSA style
        context clustering. This is required if we adhere to the
        nsp2regex.pl approach for feature identification. Right now regexes
        are generated from training data based on all the features found in
        the training data. These are given as input to order1vec.pl to
        identify features from test data. The number of features identified
        form test data can be less than the number of regexes created from
        training data (i.e. some regexes may not match anything in the test
        data). Currently this same regex file is given as input to
        order2vec.pl in LSA context clustering mode. So we also need to
        additionally provide a feature file to order2vec.pl specifying what
        features were actually found in test data
        ($PREFIX.features_in_testdata file created by discriminate.pl). If
        we create a regex file corresponding to just the regexes that
        matched at least once in the test data, then just this new regex
        file can be provided as input to order2vec.pl --featregex option.
        This change needs to be done in order1vec.pl (just as currently it
        prints clabels selectively for only those features that were found
        in the test data, it can create a new regex file containing just the
        regexes that matched at least once in the test data). An additional
        FEATURE file will then no more be required by order2vec.pl in LSA
        context clustering mode (the FEATURE file will still be required in
        SC native word or context clustering).

    *   Wherever possible and appropriate, add the error checks from
        discriminate.pl to the actual programs that require that error
        check. For example: Check for 0 zero features in order1vec.pl,
        wordvec.pl and order2vec.pl

    *   Check why the criterion function values as different across platform
        (Linux vs. Solaris). Currently the test cases for clusterstopping.pl
        used platform dependent checking - can it be made platform
        independent?

    *   The idea of global training data is to have one large file of plain
        text that is used as a source of training data for multiple target
        word discrimination problems. maketarget.pl used to produce regexes
        of the form /(line)/, presumably to be used to identify target words
        in plain corpora (where no head tags have been inserted). Does
        SenseClusters support the identification of target word features
        under these circumstances (where there is no head tag)? If so, we
        should adjust maketarget.pl so that it continues to produce target
        regexes of this form. As of 0.95 it only produces them with the head
        tag. Right now discriminate.pl insists that training data have a
        head tag in it to find tco. It seems like you should still be able
        to try and find tco features if you have specified a plain text
        regex such as we have above.

    *   The gap statistic generates expected values for randomly created
        matrices that have the same marginal totals as the observed data. In
        some cases the expected values (for a criterion function) are
        actually greater than the observed, which suggests that the random
        data is in fact benefiting more from the clustering than the
        observed data. It isn't clear why the expected values aren't always
        less than the observed, since random data when clustered should not
        really get better and better criterion function scores as the number
        of clusters increases, or at least these should not be greater or
        better than the observed values.

  Version 0.98 and 1.00
    Reorganize package, move to CPAN, make installation easier via a Bundle,
    general clean up releases. (good start made)

  Version 0.95 ToDo
    1. Introduce LSA support for context discrimination. (Done)

  Version 0.93 ToDo
    1. Introducing feature_by_context/LSA support for word clustering (done)

  Version 0.83 ToDo
    1. Integrating Cluster Stopping (done)

  Version 0.67 ToDo
    1. Integrating gcluto (not done, on hold)

  Version 0.65 ToDo
    1. Taint flag (done)
        Implement the taint mode.

  Version 0.63 ToDo
    *   Web Interface (done)

        1. Time outs (done)
            Currently if the given data or combinations of options leads to
            longer than usual processing time then the web-interface just
            hangs and does not give back any links to the results even if
            the process has finished. Investigate if the problem is with
            request time-out and find the solution for this problem.

        2. Scope Options (done)
            The current version of web-interface does not support the
            scope_train and scope_test options. Implement the same.

        3. Format option (done)
            Also the format option is not available. Implement the same.

        4. Taint flag (done)
            Understand and implement the taint mode.

  Version 0.61 Todo
    *   Labeling Clusters (done)

        Discovered clusters will be labeled with their most discriminating
        features or with the actual dictionary glosses. This will indicate
        which clusters represent which word senses.

    *   svdpackout (continuing issue)

        There remain problems with svdpackout test cases A1g, A1h, and A2 on
        Solaris. These are due to precision issues with 64 bit CPUs, and
        related issues, I think.

    *   SVD on Order1 Type vectors (done)

        Omit null columns from the order1 context vectors as created by
        order1vec. This is causing a problem for mat2harbo.pl. Hence, SVD
        fails on order1 type vectors at present.

    *   Order1 Efficiency (continuing issue)

        Order1 in its current form is way too slow even for few thousand
        contexts and features. It needs to be improved speed wise.

    *   Warnings in Demo (done)

        Demo script makedata.sh shows Malformed UTF-8 warnings. Senseval-2
        data needs to be cleaned or programs preprocess.pl and
        prepare_sval2.pl need to be modified to avoid these encoding
        warnings.

    *   Multi-Lexelt (on hold)

        preprocess.pl (and setup.pl that uses preprocess.pl) currently
        splits given DATA on lexelts. Allow multiple lexelt functionality by
        modifying preprocess.pl or using some other techniques ...

        Amruta has some mixml scripts that can be made distributable to
        handle this multi-lexelt issue. These scripts concatenate two xml
        files and create a single xml file. These are useful to combine the
        split lexelt results from setup/preprocess into a single
        multi-lexelt file and also for supporting experiments on
        multiple-lexelts as done in CONLL 2004.

    *   Installation (on hold)

        Update Makefile to check if SVDPACK, CLUTO are installed. If not,
        warn users that options like --svd or cluto can't be used.
        Auto-download and install CPAN modules like PDL, Bit::Vector.

        One idea would be to bundle and redistribute all the required
        modules and programs from other packages into SenseClusters' tarball
        and install them like regular SenseClusters' code.

    *   FAQs (ongoing)

        Include questions that would be interesting to general user
        community.

  For later versions
    *   Fuzzy Feature matching (still a good idea)

        Perl supports fuzzy pattern matching ... might be useful in our
        feature regex matching !

    *   Technical report for SenseClusters - once the sparse matrix support
        is available. (great idea)

AUTHORS
     Ted Pedersen, University of Minnesota, Duluth
     tpederse at d.umn.edu

     Amruta Purandare 
     University of Pittsburgh

     Anagha Kulkarni 
     Carnegie-Mellon University

COPYRIGHT
    Copyright (C) 2005-2008 Ted Pedersen, Amruta Purandare, Anagha Kulkarni

    Permission is granted to copy, distribute and/or modify this document
    under the terms of the GNU Free Documentation License, Version 1.2 or
    any later version published by the Free Software Foundation; with no
    Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

    Note: a copy of the GNU Free Documentation License is available on the
    web at <http://www.gnu.org/copyleft/fdl.html> and is included in this
    distribution as FDL.txt.