The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
=head1 NAME

FAQ Frequently Asked Questions about the Ngram Statistics Package

=head1 SYNOPSIS

Frequently Asked Questions about the Ngram Statistics Package. 

=head1 DESCRIPTION

This FAQ is very much a work in progress, and is written somewhat  
informally. Please take the information contained herein in that light.  
I'd be happy to elaborate on any point raised here that isn't clear. Also,  
if you spot errors or have suggestions I'd be happy to hear about those!

Please send your questions to me (tpederse@d.umn.edu) or the NSP mailing  
list L<http://groups.yahoo.com/group/ngram/>. 

=head2 How should I cite NSP in a paper? 

This is our preferred reference:

 @inproceedings{BanerjeeP03,
        author = {Banerjee, S. and Pedersen, T.},
        title = {The Design, Implementation, and Use of the 
                {N}gram {S}tatistic {P}ackage},     
        booktitle = {Proceedings of the Fourth International 
                Conference on Intelligent Text Processing and
                Computational Linguistics},
        year = {2003},
        month ={February},
	pages = {370--381}, 
        address = {Mexico City}}

You can get a copy of this paper at:

 L<http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf>

If you would also like to cite a URL, the official home page of NSP is :

 L<http://ngram.sourceforge.net>

=head2 Where can I find other papers by users of NSP?

We have established an NSP bibliography that includes some papers by 
users. Please make contributions to this bibliography when you publish a 
paper using NSP!

 L<http://www.d.umn.edu/~tpederse/nsp-bib/>

=head2 General Questions

=head2 What is the class structure of your Measure hierarchy?

The following diagram tries to lay that all out:

L<http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.13/doc/NSP-Class-diagram.pdf>

In general we've tried to group the measures into families that reflect 
their underlying relationships, and allow us to avoid replicating a lot 
of code. 

=head3 Why do you have 'make test' and then csh ./ALL-TESTS.sh

This reflects the historical development of NSP. Initially NSP was based 
on command line scripts, and was not easy to test using the standard 
Perl methods you see in 'make test'. However, once NSP went object 
oriented, using 'make test' became much easier.

We use ./ALL-TESTS.sh to test our command line programs very 
extensively, so you should still run this and take note of any errors. 
Please make sure you run 'csh ./ALL-TESTS.sh' AFTER running 'make 
install', otherwise some of the tests cases won't find the modules they 
need (since they haven't been installed and since ./ALL-TESTS.sh does 
not look into the build libraries prior to installation. 

=head3 Why do you use the C shell (csh) so much? 

This reflects the historical development of NSP. Initially NSP was 
developed on Solaris systems, which use tcsh as their default shell. 
Since we do not do anything too fancy with the shell, it seemed like a 
reasonable thing to stay with csh even after our site and many others 
moved to Linux, where the bash shell is more common. 

=head3 What is in the future for NSP? 

Many things! See L<TODO.pod> for a fairly complete accounting  
of our future plans. We are always open to suggestions, so please send 
those to the mailing list for consideration and discussion. 

=head3 Why is NSP so slow on large files of text? 

As of version 0.61 (and before) NSP does all counting using a Perl hash. 
In other words, each Ngram is an element in a hash, and a counter is kept 
for that Ngram. This works very nicely, but if you start to deal with 
millions of words of text it can result in a very large hash that consumes 
quite a bit of memory. 

Remember, the crucial constraint is not how many words in the corpus, but  
how many unique Ngrams you have in the text. As N gets larger, the number  
of hash elements grows larger and larger. However, if you are dealing  
with unigrams NSP can process extremely large files (50 million words)  
quite efficiently.  

To start with I would recommend that you gradually increase the value 
of N and the amount of text you try to process with NSP, just to get a  
sense of how long things take on your system. Start with processing 
unigrams in a 1,000,000 word corpus, then try bigrams. Then move up to 
5,000,000 words with unigrams, and so forth. 

=head2 Questions about Significance Testing

=head3 What is Fisher's exact test? 

Fisher's exact test is a significance test that is considered to 
be more appropriate for sparse and skewed samples of data than statistics 
such as the log likelihood ratio (G^2) or Pearson's Chi-Squared test 
(X^2). 

The paper "Fishing for Exactness" gives a fairly detailed description
of Fisher's exact test relative to these other tests. Find it with the
1996 entries at: L<http://www.d.umn.edu/~tpederse/pubs.html>
It is also available at: L<http://xxx.lanl.gov/abs/cmp-lg/9608010>

=head3 What is a left sided, right sided, and two sided Fisher's exact test?

Fisher's exact test is computed by fixing the marginal totals of a 2x2 
table and then determining the probability of each of the possible 
tables that could result in those marginal totals. The different sided 
tests vary in how these individual probabilities are summed into the 
final value produced by the test. 

The left sided test is recommended (see 3) since it results
in a value that is easy to interpret as a measure of association
between words. Rather loosely speaking, it is the probability of 
randomly sampling a 2x2 table where a bigram occurs less frequently 
than was observed in the corpus you are working with. So, a high left 
sided probability indicates that you are very unlikely to observe
the bigram more frequently than you already have (if you took a 
random sample from a similar population). This suggests that the 
bigram is a "special" pair of words that may merit further attention.

The paper "Fishing for Exactness" gives a fairly detailed description
of computing left, right and two sided values. Find it with the 1996 
entries at: L<http://www.d.umn.edu/~tpederse/pubs.html>
It is also available at: L<http://xxx.lanl.gov/abs/cmp-lg/9608010>

=head3 Why do you recommend the use of a left sided Fisher's exact test?

The main advantage of the left sided test it is interpreted much like
the other tests that we provide. The larger the value the stronger the 
association between the words in the bigram. This is the same property as 
mutual information (tmi.pm), the Dice Coefficient (dice.pm), Loglikelihood 
ratio (ll.pm), and Pearson's test (x2.pm).  Rankings that are done by 
these measures can all be compared directly, where rank 1 is assigned to 
the most strongly associated bigram/s.

A right sided test is symmetric with the left sided test, so larger
values indicate weaker association between the words. There is nothing
wrong with this approach, except that you can not directly compare the
rankings of a right sided test with, for example, mutual information. 

A two sided test doesn't seem to have a very natural mapping to
explaining bigram data. 

=head3 Why don't the log likelihood tests (ll.pm) and Pearson's test (x2.pm) report the significance values associated with their raw scores? 

We don't assign significance values to these tests since setting the 
p-values seems to be a fairly arbitrary choice. It isn't clear that p 
should be .01, .05, or something else. This strikes us as an important  
limitation of significance tests. Rather than setting an arbitrary cutoff 
on these values, we recommend looking at the raw scores from these tests 
in order to establish cutoffs. 

Having said all this, it is very likely that future versions of NSP 
will have this capability. 

=head3 How do I decide which of the tests your provide is the one I should use?

The tests provided here can be divided into three classes:

 1) Power Divergence Family:
	Log Likelihood ratio (ll.pm) 
	Pearson's Chi Squared test (x2.pm)

 2) Exact test of statistical significance
	Fisher's exact test (leftFisher.pm and rightFisher.pm)

 3) Information theoretic measures
	Mutual Information (tmi.pm)
	Dice Coefficient (dice.pm)

If you would like to use a significance test with a predefined p-value
then Fisher's exact test is your choice. We have tried to implement this
as efficiently as possible, and it seems to run fairly quickly even with
large sample sizes. 

If you would prefer to have a score that provides you with a measure of
the divergence of the observed from the expected values (given an
assumption of independence between the words in the bigram) then the
power divergence family is a fine choice. Note that Ll.pm and X2.pm should
provide the same results if the data is such that no asymptotic
assumptions are being violated. If you observe difference values for the
same bigram from these tests, then one of them is flawed! You can't be
sure which one it is either. There is some very fine background material
on the issue of the likelihood ratio (G^2) versus Pearson's test (X^2)
in the following:

	@book{ReadC88,
        author={Read, T. and Cressie, N.},
        title={Goodness of fit Statistics for Discrete Multivariate Data},
        year = {1988},
        address = {New York, NY},
        publisher = {Springer-Verlag}}

If you are interested in Mutual information, then you have it, and you
have the closely related alternative measure the Dice Coefficient.                                            

=head3 How do I interpret the values of these test scores?

Carefully. Subjectively. Creatively. 

For all of the tests, a higher score means a stronger measure of
association among the words in the bigram.

In general, we prefer to make comparisons among the tests based on the
different rankings they assign rather than the absolute scores they
return. However, you can directly compare the scores from the likelihood
ratio (Ll.pm) and Pearson's test (X2.pm). Otherwise, you should not 
think of comparing these scores because they are apples and oranges. 
The rank program (rank.pl) ranks the bigrams as scored by any two
given measures and will produce a table and rank correlation coefficient
showing how they compare. Please note however that you should not
include a right sided or two sided Fisher's test in such comparisons 
since they do not follow the convention that a higher score means that
there is greater association between the bigrams. These tests should
not be compared to others with the rank.pl program!

=head3 Fisher's left sided test seems to rank a lot of bigrams first! Why?

As sample sizes get larger, the hypergeometric probabilities associated
with each possible 2x2 table of bigram data (given fixed marginal totals)
tend to approach 1. Consider setting the precision of the test
relatively high (10 or 15 digits) in order to observe this. When the
default setting is used (4 digits) there tends to be quite a lot of
rounding to 1.0000. 

The paper "Fishing for Exactness" shows how these hypergeometric
probabilities are computed. Find it with the 1996  entries at: 
L<http://www.d.umn.edu/~tpederse/pubs.html> It is also available at: 
L<http://xxx.lanl.gov/abs/cmp-lg/9608010>

=head2 Questions about stop lists and removing non-tokens

=head3 How does the stop list option (--stop) work?

[Revised for version 0.53]

A stop list is a list of words that you don't want count.pl to count.
These words (or the Ngrams that they form) are not counted and do not
figure into the overall sample size. 

Rather than specifying a list of words in a stoplist, as of version 0.53
you can specify Perl regular expressions.

There are two modes in which stop lists can be used. In the "AND" mode,
every word that makes up an N-gram must be found in the stoplist for that
N-gram to be eliminated. In the "OR" mode, any N-gram that includes at  
least one word from the stoplist will be eliminated. 

Your stop list should be a plain text file that has one Perl regular 
expression per line. 

For example, suppose your stop list consisted of:

 @stop.mode=OR
 /\bthe\b/
 /\band\b/
 /\bof\b/

[Note that \b indicates a word boundary in a Perl regex.]

Any N-gram that contains 'the', 'and', or 'of' would be excluded. 

A note on counting. Suppose that the bigram "of the" occurred 700 times
in a text and was excluded by an AND mode stop list. Let's suppose for  
this example that it is the only bigram excluded. The total sample size 
reported by count.pl would be 700 less than without the stop list. 
The frequency count for "of" occurring as the first component of a bigram
would be reduced by 700, as would the frequency count for "the" occurring
as the second component of a bigram. Note that the counts for "of" as
the second component and "the" as the first component will not be
affected by the stoplist. Note that without the stop list it will
typically be the case that the first and second position counts for
a word will be the same. 

=head3 Can I have modifiers in my stoplist regular expressions? 

No. :)

The documentation suggests that given a list of regular expressions 
such as:

 /\bis\b/
 /\bthe\b/
 /\ban\b/

NSP actually checks each regular expression one by one. Unfortunately if 
the list of regex's is very long, this becomes too slow computationally, 
and so instead we actually concatenate all the regular expressions to form 
one big regex, which is then used to do the matching. For example given the 
regexes above, they will be combined into a single regex, like so:

/(\bis\b)|(\bthe\b)|(\ban\b)/

and then this regex is used to do the matching. Observe of course that this 
produces exactly the same effect as if we had done the comparisons one 
after another, and runs much faster. However the price we pay is that you 
can't use modifiers (things outside the / /'s) while defining the regex, 
since then we wouldn't be able to concatenate them together like we are 
doing now.

=head3 Why is there a --stop and a --nontoken option for count.pl? 

The stop option allows you to specify a stop list that eliminates Ngrams  
if they are completely made up  of stop words (AND mode) or if one of the  
words in the Ngram is a stop word (OR mode). The effect of the stop   
option is to remove Ngrams from the sample. 

The nontoken option allows you to eliminate words from the text prior to 
the formation of Ngrams. This processing occurs well before the stop 
option, which is carried out after Ngrams have been formed.

=head3 How can I disregard n-grams that cross sentence boundaries or other punctuation marks? 

For example, in :

 I am here today. Where are you?

 I do not want to consider "today Where are" as a 3-gram.

There is a built-in option (--newline) to disregard Ngrams across the  
newline (\n), but there is not one to do the same across punctuation 
marks at this time.  But you can use the existing functionality to  
accomplish the same! 

Replace punctuation marks with new line characters and then use the  
option --newLine. This will disregard Ngrams that cross newline boundaries 
and thereby (in this case) cross punctuation marks. 

For example, assume file "sentence.txt" is as follows:

 this is a sentence. this is a sentence.

Then the following command-line script:

 perl -e "while(<>){s/[.]/\n/g; print}" sentence.txt > sentence.tmp

would create the following in file "sentence.tmp":
 
 this is a sentence
  this is a sentence

(To use more punctuation marks, you'd want to put them inside the [square 
brackets])

And then if you run the following:

 count.pl --token token_file.txt --newLine sentence.cnt sentence.tmp

the following is produced in file "sentence.cnt":

 6
 a<>sentence<>2 2 2
 this<>is<>2 2 2
 is<>a<>2 2 2

Of course, there'd be other ways of replacing punctuation marks with 
newlines.  Programs like sed would probably do the trick too...

There's one problem in this approach though. Assume the following is your 
original "sentence.txt" file:

 this is a sentence. this is
 a sentence.

Replacing '.' with \n would produce the following:

 this is a sentence
 this is
 a sentence

And then if you use --newLine, you'd not be able to catch the second
"is<>a" bigram.

To avoid this I would suggest the following:

 1. First replace all new lines with spaces
 2. Then replace punctuations with new lines.

The following command-line script seems to work:

 perl -e "while(<>){chomp; s/[.]/\n/g; print}" sentence.txt > sentence.tmp

Be warned though that very long lines can lead to very slow processing. 

=head2 Questions about rank.pl

=head3 In rank.pl you use Spearman's rank correlation coefficient. Why not use Kendall's tau or some other correlation measure?

The reason we use Spearman's measure is that it is geared towards 
correlation among ranked items. Kendall's tau can be used with the actual 
values of the ranked items. However, when comparing different kinds of 
tests of association such direct comparisons may not be meaningful. For 
example, the Dice Coefficient and the Log-likelihood ratio produce values 
that are on different scales, and you can't compare them directly. This is 
why we focus on comparing ranked values, which is why we use Spearman's.

However, when your data has numerous ties in the ranks, then you may 
want to employ a method of other than Spearman's as provided via 
rank.pl, which reranks the tied items but continues to use the standard 
Spearman's formula. See the TODO list for more discussion on this point.  

=head2 Questions about kocos.pl

=head3 What is a kth order co-occurrence? 

Two words are 2nd order co-occurrences if they occur with words that occur 
with a particular word. For example, suppose that we observe the   
following bigrams in a corpus of text:

 telephone line
 busy line
 telephone operator

These are first order co-occurrences. 

"telephone" and "busy" are said to be second order co-occurrences of each 
other since they both occur with "line". (Both are first order 
co-occurrences of "line"). Then "operator" is said to be a second order 
co-occurrence of "line" since it occurs with "telephone" which 
is a first order co-occurrence of "line". 

These relationships can be visualized as graphs or chains.

 line -> telephone -> operator

kocos.pl finds such kth order relationships among words. 

=head1 AUTHORS

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu         

 Satanjeev Banerjee, Carnegie-Mellon University

=head1 BUGS

=head1 SEE ALSO

 NSP home : L<http://ngram.sourceforge.net>
 NSP mailing list : L<http://groups.yahoo.com/group/ngram/>

=head1 COPYRIGHT

Copyright (C) 2000-2010 Ted Pedersen

Permission is granted to copy, distribute and/or modify this  document  
under the terms of the GNU Free Documentation License, Version 1.2 or  any  
later version published by the Free Software Foundation; with no  
Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web   
at L<http://www.gnu.org/copyleft/fdl.html> and is included in this    
distribution as FDL.txt.