The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
=head1 NAME

TODO - List of things TODO for SenseClusters

=head1 SYNOPSIS

Plans for future versions of SenseClusters

=head1 DESCRIPTION

=head2 Version 1.03 Todo

=over

=item * Consider the use of SVDLIBC rather than SVDPACKC for SVD, 
however, have been having problems compiling SVDLIBC on 64 bit 
platforms.

=item * Testing of SVDPACK remains problematic, since results can vary 
from platform to platform. At present we are simply testing to see if 
output is created.

=item * Introduction of CPAN style 'make test' option. Can be coded even 
for command line interfaces, but can be a little messy, especially when 
a program is not producing a single value but rather tables of values or 
formatted text (which is often what we do). Must also do this in such a 
way that it handles different file system (via File::Spec probably).

=item * Improve order 1 efficiency. Rather than matching each context 
against every feature (regular expression), match each possible feature in 
context with the features. The simple approach for invoking count.pl for
each context by creating a temporary file for each context suffers from 
file and process creation overhead. Need to investigate NSP APIs and use 
them so that features can be identified from contexts in memory instead 
of having to create temporary files.

=item * Provide simple tools that allow a user to visualize more 
complicated data structures like the 2nd order context vectors. For 
example, right now a user can see the word vectors associated that will be 
averaged together (in the .wordvec file), but they are purely numeric. 
It would be nice to see what words are associated with these values. 

=item * Make discriminate.pl more modular in its organization, 
possibly through the use of subroutines. Reduce reliance on system calls 
which can lead to portability problems. 

=item * Fix --showargs option on discriminate.pl. Has not been working
for many versions now. 

=item * Check to see if checking return values from vcluster and 
scluster in discriminate.pl is really accomplishing anything. 
Do they return error codes and success codes reliably? Any chance for 
false positive or false negative?

=item * Replace default stoplist with a program that automatically 
generates stoplists from a given corpus. 

=item * Add version information to SenseClusters web interface,  
including version of SenseClusters and modules it is using.

=item * There are various small utility programs whose return codes are
checked by discriminate. These programs seem to always return via exit,
suggesting that their return codes are always for success. May want to 
return a different value for failures so the discriminate.pl checks are 
meaningful.

=item * Reduce the number of regular expressions used in the regex file
provided to order2vec.pl for feature identification during LSA style context
clustering. This is required if we adhere to the nsp2regex.pl approach 
for feature identification. Right now regexes are generated from 
training data based on all the features found in the training data. 
These are given as input to order1vec.pl to identify features from
test data. The number of features identified form test data can be less
than the number of regexes created from training data (i.e. some regexes
may not match anything in the test data). Currently this same regex file
is given as input to order2vec.pl in LSA context clustering mode. So we
also need to additionally provide a feature file to order2vec.pl specifying
what features were actually found in test data ($PREFIX.features_in_testdata
file created by discriminate.pl). If we create a regex file corresponding
to just the regexes that matched at least once in the test data, then just
this new regex file can be provided as input to order2vec.pl --featregex
option. This change needs to be done in order1vec.pl (just as currently
it prints clabels selectively for only those features that were found in
the test data, it can create a new regex file containing just the regexes
that matched at least once in the test data). An additional FEATURE file
will then no more be required by order2vec.pl in LSA context clustering
mode (the FEATURE file will still be required in SC native word or 
context clustering).

=item * Wherever possible and appropriate, add the error checks 
from discriminate.pl to the actual programs that require that error 
check. For example: Check for 0 zero features in order1vec.pl, 
wordvec.pl and order2vec.pl

=item * Check why the criterion function values as different across
platform (Linux vs. Solaris). Currently the test cases for
clusterstopping.pl used platform dependent checking - can it
be made platform independent?

=item * The idea of global training data is to have one large file
of plain text that is used as a source of training data for multiple
target word discrimination problems. maketarget.pl used to produce
regexes of the form /(line)/, presumably to be used to identify 
target words in plain corpora (where no head tags have been inserted).
Does SenseClusters support the identification of target word features
under these circumstances (where there is no head tag)? If so, we
should adjust maketarget.pl so that it continues to produce target
regexes of this form. As of 0.95 it only produces them with the head
tag. Right now discriminate.pl insists that training data have a head
tag in it to find tco. It seems like you should still be able to try
and find tco features if you have specified a plain text regex such
as we have above. 

=item * The gap statistic generates expected values for randomly
created matrices that have the same marginal totals as the observed
data. In some cases the expected values (for a criterion function)
are actually greater than the observed, which suggests that the
random data is in fact benefiting more from the clustering than the
observed data. It isn't clear why the expected values aren't always
less than the observed, since random data when clustered should not
really get better and better criterion function scores as the number
of clusters increases, or at least these should not be greater or better
than the observed values. 

=back

=head2 Version 0.98 and 1.00

=over

=item Reorganize package, move to CPAN, make installation easier 
via a Bundle, general clean up releases. (good start made)

=back

=head2 Version 0.95 ToDo

=over 

=item 1. Introduce LSA support for context discrimination. (Done)

=back

=head2 Version 0.93 ToDo

=over 

=item 1. Introducing feature_by_context/LSA support for word clustering (done)

=back

=head2 Version 0.83 ToDo

=over

=item 1. Integrating Cluster Stopping (done)

=back

=head2 Version 0.67 ToDo

=over

=item 1. Integrating gcluto (not done, on hold)

=back

=head2 Version 0.65 ToDo

=over

=item 1. Taint flag (done)

Implement the taint mode.

=back

=head2 Version 0.63 ToDo

=over 

=item * Web Interface (done)

=over

=item 1. Time outs (done)

Currently if the given data or combinations of options leads to longer
than usual processing time then the web-interface just hangs and does 
not give back any links to the results even if the process has finished.
Investigate if the problem is with request time-out and find the solution 
for this problem.

=item 2. Scope Options (done)

The current version of web-interface does not support the scope_train and 
scope_test options. Implement the same.

=item 3. Format option (done)

Also the format option is not available. Implement the same.

=item 4. Taint flag (done)

Understand and implement the taint mode.

=back

=back

=head2 Version 0.61 Todo 

=over 4

=item * Labeling Clusters (done)

Discovered clusters will be labeled with their most discriminating features
or with the actual dictionary glosses. This will indicate which clusters 
represent which word senses.

=item * svdpackout (continuing issue)

There remain problems with svdpackout test cases A1g, A1h, and A2 on 
Solaris. These are due to precision issues with 64 bit CPUs, and related
issues, I think. 

=item * SVD on Order1 Type vectors (done)

Omit null columns from the order1 context vectors as created by order1vec. 
This is causing a problem for mat2harbo.pl. Hence, SVD fails on order1 type 
vectors at present.

=item * Order1 Efficiency (continuing issue)

Order1 in its current form is way too slow even for few thousand contexts
and features. It needs to be improved speed wise.
 
=item * Warnings in Demo (done)

Demo script makedata.sh shows Malformed UTF-8 warnings. Senseval-2 data 
needs to be cleaned or programs preprocess.pl and prepare_sval2.pl need to
be modified to avoid these encoding warnings.

=item * Multi-Lexelt (on hold)

preprocess.pl (and setup.pl that uses preprocess.pl) currently splits 
given  DATA on lexelts. Allow multiple lexelt functionality by modifying 
preprocess.pl or using some other techniques ... 

Amruta has some mixml scripts that can be made distributable to handle this
multi-lexelt issue. These scripts concatenate two xml files and create a 
single xml file. These are useful to combine the split lexelt results from
setup/preprocess into a single multi-lexelt file and also for supporting 
experiments on multiple-lexelts as done in CONLL 2004.

=item * Installation (on hold)

Update Makefile to check if SVDPACK, CLUTO are installed. If not, warn users 
that options like --svd or cluto can't be used. Auto-download and install 
CPAN modules like PDL, Bit::Vector. 

One idea would be to bundle and redistribute all the required modules and 
programs from other packages into SenseClusters' tarball and install them like 
regular SenseClusters' code.

=item * FAQs  (ongoing)

Include questions that would be interesting to general user community.

=back

=head2 For later versions 

=over 4

=item * Fuzzy Feature matching (still a good idea)

Perl supports fuzzy pattern matching ... 
might be useful in our feature regex matching !

=item * 

Technical report for SenseClusters - once the sparse matrix support is 
available. (great idea)

=back

=head1 AUTHORS

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Amruta Purandare 
 University of Pittsburgh

 Anagha Kulkarni 
 Carnegie-Mellon University

=head1 COPYRIGHT

Copyright (C) 2005-2008 Ted Pedersen, Amruta Purandare, Anagha Kulkarni

Permission is granted to copy, distribute and/or modify this  document  
under the terms of the GNU Free Documentation License, Version 1.2 or  any  
later version published by the Free Software Foundation; with no  
Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web   
at L<http://www.gnu.org/copyleft/fdl.html> and is included in this    
distribution as FDL.txt. 

=cut