The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    FAQ Frequently Asked Questions about the Ngram Statistics Package

SYNOPSIS
    Frequently Asked Questions about the Ngram Statistics Package.

DESCRIPTION
    This FAQ is very much a work in progress, and is written somewhat
    informally. Please take the information contained herein in that light.
    I'd be happy to elaborate on any point raised here that isn't clear.
    Also, if you spot errors or have suggestions I'd be happy to hear about
    those!

    Please send your questions to me (tpederse@d.umn.edu) or the NSP mailing
    list <http://groups.yahoo.com/group/ngram/>.

  How should I cite NSP in a paper?
    This is our preferred reference:

     @inproceedings{BanerjeeP03,
            author = {Banerjee, S. and Pedersen, T.},
            title = {The Design, Implementation, and Use of the 
                    {N}gram {S}tatistic {P}ackage},     
            booktitle = {Proceedings of the Fourth International 
                    Conference on Intelligent Text Processing and
                    Computational Linguistics},
            year = {2003},
            month ={February},
            pages = {370--381}, 
            address = {Mexico City}}

    You can get a copy of this paper at:

     L<http://www.d.umn.edu/~tpederse/Pubs/cicling2003-2.pdf>

    If you would also like to cite a URL, the official home page of NSP is :

     L<http://ngram.sourceforge.net>

  Where can I find other papers by users of NSP?
    We have established an NSP bibliography that includes some papers by
    users. Please make contributions to this bibliography when you publish a
    paper using NSP!

     L<http://www.d.umn.edu/~tpederse/nsp-bib/>

  General Questions
  What is the class structure of your Measure hierarchy?
    The following diagram tries to lay that all out:

    <http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.13/doc/NSP-Class-dia
    gram.pdf>

    In general we've tried to group the measures into families that reflect
    their underlying relationships, and allow us to avoid replicating a lot
    of code.

   Why do you have 'make test' and then csh ./ALL-TESTS.sh
    This reflects the historical development of NSP. Initially NSP was based
    on command line scripts, and was not easy to test using the standard
    Perl methods you see in 'make test'. However, once NSP went object
    oriented, using 'make test' became much easier.

    We use ./ALL-TESTS.sh to test our command line programs very
    extensively, so you should still run this and take note of any errors.
    Please make sure you run 'csh ./ALL-TESTS.sh' AFTER running 'make
    install', otherwise some of the tests cases won't find the modules they
    need (since they haven't been installed and since ./ALL-TESTS.sh does
    not look into the build libraries prior to installation.

   Why do you use the C shell (csh) so much?
    This reflects the historical development of NSP. Initially NSP was
    developed on Solaris systems, which use tcsh as their default shell.
    Since we do not do anything too fancy with the shell, it seemed like a
    reasonable thing to stay with csh even after our site and many others
    moved to Linux, where the bash shell is more common.

   What is in the future for NSP?
    Many things! See TODO.pod for a fairly complete accounting of our future
    plans. We are always open to suggestions, so please send those to the
    mailing list for consideration and discussion.

   Why is NSP so slow on large files of text?
    As of version 0.61 (and before) NSP does all counting using a Perl hash.
    In other words, each Ngram is an element in a hash, and a counter is
    kept for that Ngram. This works very nicely, but if you start to deal
    with millions of words of text it can result in a very large hash that
    consumes quite a bit of memory.

    Remember, the crucial constraint is not how many words in the corpus,
    but how many unique Ngrams you have in the text. As N gets larger, the
    number of hash elements grows larger and larger. However, if you are
    dealing with unigrams NSP can process extremely large files (50 million
    words) quite efficiently.

    To start with I would recommend that you gradually increase the value of
    N and the amount of text you try to process with NSP, just to get a
    sense of how long things take on your system. Start with processing
    unigrams in a 1,000,000 word corpus, then try bigrams. Then move up to
    5,000,000 words with unigrams, and so forth.

  Questions about Significance Testing
   What is Fisher's exact test?
    Fisher's exact test is a significance test that is considered to be more
    appropriate for sparse and skewed samples of data than statistics such
    as the log likelihood ratio (G^2) or Pearson's Chi-Squared test (X^2).

    The paper "Fishing for Exactness" gives a fairly detailed description of
    Fisher's exact test relative to these other tests. Find it with the 1996
    entries at: <http://www.d.umn.edu/~tpederse/pubs.html> It is also
    available at: <http://xxx.lanl.gov/abs/cmp-lg/9608010>

   What is a left sided, right sided, and two sided Fisher's exact test?
    Fisher's exact test is computed by fixing the marginal totals of a 2x2
    table and then determining the probability of each of the possible
    tables that could result in those marginal totals. The different sided
    tests vary in how these individual probabilities are summed into the
    final value produced by the test.

    The left sided test is recommended (see 3) since it results in a value
    that is easy to interpret as a measure of association between words.
    Rather loosely speaking, it is the probability of randomly sampling a
    2x2 table where a bigram occurs less frequently than was observed in the
    corpus you are working with. So, a high left sided probability indicates
    that you are very unlikely to observe the bigram more frequently than
    you already have (if you took a random sample from a similar
    population). This suggests that the bigram is a "special" pair of words
    that may merit further attention.

    The paper "Fishing for Exactness" gives a fairly detailed description of
    computing left, right and two sided values. Find it with the 1996
    entries at: <http://www.d.umn.edu/~tpederse/pubs.html> It is also
    available at: <http://xxx.lanl.gov/abs/cmp-lg/9608010>

   Why do you recommend the use of a left sided Fisher's exact test?
    The main advantage of the left sided test it is interpreted much like
    the other tests that we provide. The larger the value the stronger the
    association between the words in the bigram. This is the same property
    as mutual information (tmi.pm), the Dice Coefficient (dice.pm),
    Loglikelihood ratio (ll.pm), and Pearson's test (x2.pm). Rankings that
    are done by these measures can all be compared directly, where rank 1 is
    assigned to the most strongly associated bigram/s.

    A right sided test is symmetric with the left sided test, so larger
    values indicate weaker association between the words. There is nothing
    wrong with this approach, except that you can not directly compare the
    rankings of a right sided test with, for example, mutual information.

    A two sided test doesn't seem to have a very natural mapping to
    explaining bigram data.

   Why don't the log likelihood tests (ll.pm) and Pearson's test (x2.pm) report the significance values associated with their raw scores?
    We don't assign significance values to these tests since setting the
    p-values seems to be a fairly arbitrary choice. It isn't clear that p
    should be .01, .05, or something else. This strikes us as an important
    limitation of significance tests. Rather than setting an arbitrary
    cutoff on these values, we recommend looking at the raw scores from
    these tests in order to establish cutoffs.

    Having said all this, it is very likely that future versions of NSP will
    have this capability.

   How do I decide which of the tests your provide is the one I should use?
    The tests provided here can be divided into three classes:

     1) Power Divergence Family:
            Log Likelihood ratio (ll.pm) 
            Pearson's Chi Squared test (x2.pm)

     2) Exact test of statistical significance
            Fisher's exact test (leftFisher.pm and rightFisher.pm)

     3) Information theoretic measures
            Mutual Information (tmi.pm)
            Dice Coefficient (dice.pm)

    If you would like to use a significance test with a predefined p-value
    then Fisher's exact test is your choice. We have tried to implement this
    as efficiently as possible, and it seems to run fairly quickly even with
    large sample sizes.

    If you would prefer to have a score that provides you with a measure of
    the divergence of the observed from the expected values (given an
    assumption of independence between the words in the bigram) then the
    power divergence family is a fine choice. Note that Ll.pm and X2.pm
    should provide the same results if the data is such that no asymptotic
    assumptions are being violated. If you observe difference values for the
    same bigram from these tests, then one of them is flawed! You can't be
    sure which one it is either. There is some very fine background material
    on the issue of the likelihood ratio (G^2) versus Pearson's test (X^2)
    in the following:

            @book{ReadC88,
            author={Read, T. and Cressie, N.},
            title={Goodness of fit Statistics for Discrete Multivariate Data},
            year = {1988},
            address = {New York, NY},
            publisher = {Springer-Verlag}}

    If you are interested in Mutual information, then you have it, and you
    have the closely related alternative measure the Dice Coefficient.

   How do I interpret the values of these test scores?
    Carefully. Subjectively. Creatively.

    For all of the tests, a higher score means a stronger measure of
    association among the words in the bigram.

    In general, we prefer to make comparisons among the tests based on the
    different rankings they assign rather than the absolute scores they
    return. However, you can directly compare the scores from the likelihood
    ratio (Ll.pm) and Pearson's test (X2.pm). Otherwise, you should not
    think of comparing these scores because they are apples and oranges. The
    rank program (rank.pl) ranks the bigrams as scored by any two given
    measures and will produce a table and rank correlation coefficient
    showing how they compare. Please note however that you should not
    include a right sided or two sided Fisher's test in such comparisons
    since they do not follow the convention that a higher score means that
    there is greater association between the bigrams. These tests should not
    be compared to others with the rank.pl program!

   Fisher's left sided test seems to rank a lot of bigrams first! Why?
    As sample sizes get larger, the hypergeometric probabilities associated
    with each possible 2x2 table of bigram data (given fixed marginal
    totals) tend to approach 1. Consider setting the precision of the test
    relatively high (10 or 15 digits) in order to observe this. When the
    default setting is used (4 digits) there tends to be quite a lot of
    rounding to 1.0000.

    The paper "Fishing for Exactness" shows how these hypergeometric
    probabilities are computed. Find it with the 1996 entries at:
    <http://www.d.umn.edu/~tpederse/pubs.html> It is also available at:
    <http://xxx.lanl.gov/abs/cmp-lg/9608010>

  Questions about stop lists and removing non-tokens
   How does the stop list option (--stop) work?
    [Revised for version 0.53]

    A stop list is a list of words that you don't want count.pl to count.
    These words (or the Ngrams that they form) are not counted and do not
    figure into the overall sample size.

    Rather than specifying a list of words in a stoplist, as of version 0.53
    you can specify Perl regular expressions.

    There are two modes in which stop lists can be used. In the "AND" mode,
    every word that makes up an N-gram must be found in the stoplist for
    that N-gram to be eliminated. In the "OR" mode, any N-gram that includes
    at least one word from the stoplist will be eliminated.

    Your stop list should be a plain text file that has one Perl regular
    expression per line.

    For example, suppose your stop list consisted of:

     @stop.mode=OR
     /\bthe\b/
     /\band\b/
     /\bof\b/

    [Note that \b indicates a word boundary in a Perl regex.]

    Any N-gram that contains 'the', 'and', or 'of' would be excluded.

    A note on counting. Suppose that the bigram "of the" occurred 700 times
    in a text and was excluded by an AND mode stop list. Let's suppose for
    this example that it is the only bigram excluded. The total sample size
    reported by count.pl would be 700 less than without the stop list. The
    frequency count for "of" occurring as the first component of a bigram
    would be reduced by 700, as would the frequency count for "the"
    occurring as the second component of a bigram. Note that the counts for
    "of" as the second component and "the" as the first component will not
    be affected by the stoplist. Note that without the stop list it will
    typically be the case that the first and second position counts for a
    word will be the same.

   Can I have modifiers in my stoplist regular expressions?
    No. :)

    The documentation suggests that given a list of regular expressions such
    as:

     /\bis\b/
     /\bthe\b/
     /\ban\b/

    NSP actually checks each regular expression one by one. Unfortunately if
    the list of regex's is very long, this becomes too slow computationally,
    and so instead we actually concatenate all the regular expressions to
    form one big regex, which is then used to do the matching. For example
    given the regexes above, they will be combined into a single regex, like
    so:

    /(\bis\b)|(\bthe\b)|(\ban\b)/

    and then this regex is used to do the matching. Observe of course that
    this produces exactly the same effect as if we had done the comparisons
    one after another, and runs much faster. However the price we pay is
    that you can't use modifiers (things outside the / /'s) while defining
    the regex, since then we wouldn't be able to concatenate them together
    like we are doing now.

   Why is there a --stop and a --nontoken option for count.pl?
    The stop option allows you to specify a stop list that eliminates Ngrams
    if they are completely made up of stop words (AND mode) or if one of the
    words in the Ngram is a stop word (OR mode). The effect of the stop
    option is to remove Ngrams from the sample.

    The nontoken option allows you to eliminate words from the text prior to
    the formation of Ngrams. This processing occurs well before the stop
    option, which is carried out after Ngrams have been formed.

   How can I disregard n-grams that cross sentence boundaries or other punctuation marks?
    For example, in :

     I am here today. Where are you?

     I do not want to consider "today Where are" as a 3-gram.

    There is a built-in option (--newline) to disregard Ngrams across the
    newline (\n), but there is not one to do the same across punctuation
    marks at this time. But you can use the existing functionality to
    accomplish the same!

    Replace punctuation marks with new line characters and then use the
    option --newLine. This will disregard Ngrams that cross newline
    boundaries and thereby (in this case) cross punctuation marks.

    For example, assume file "sentence.txt" is as follows:

     this is a sentence. this is a sentence.

    Then the following command-line script:

     perl -e "while(<>){s/[.]/\n/g; print}" sentence.txt > sentence.tmp

    would create the following in file "sentence.tmp":

     this is a sentence
      this is a sentence

    (To use more punctuation marks, you'd want to put them inside the
    [square brackets])

    And then if you run the following:

     count.pl --token token_file.txt --newLine sentence.cnt sentence.tmp

    the following is produced in file "sentence.cnt":

     6
     a<>sentence<>2 2 2
     this<>is<>2 2 2
     is<>a<>2 2 2

    Of course, there'd be other ways of replacing punctuation marks with
    newlines. Programs like sed would probably do the trick too...

    There's one problem in this approach though. Assume the following is
    your original "sentence.txt" file:

     this is a sentence. this is
     a sentence.

    Replacing '.' with \n would produce the following:

     this is a sentence
     this is
     a sentence

    And then if you use --newLine, you'd not be able to catch the second
    "is<>a" bigram.

    To avoid this I would suggest the following:

     1. First replace all new lines with spaces
     2. Then replace punctuations with new lines.

    The following command-line script seems to work:

     perl -e "while(<>){chomp; s/[.]/\n/g; print}" sentence.txt > sentence.tmp

    Be warned though that very long lines can lead to very slow processing.

  Questions about rank.pl
   In rank.pl you use Spearman's rank correlation coefficient. Why not use Kendall's tau or some other correlation measure?
    The reason we use Spearman's measure is that it is geared towards
    correlation among ranked items. Kendall's tau can be used with the
    actual values of the ranked items. However, when comparing different
    kinds of tests of association such direct comparisons may not be
    meaningful. For example, the Dice Coefficient and the Log-likelihood
    ratio produce values that are on different scales, and you can't compare
    them directly. This is why we focus on comparing ranked values, which is
    why we use Spearman's.

    However, when your data has numerous ties in the ranks, then you may
    want to employ a method of other than Spearman's as provided via
    rank.pl, which reranks the tied items but continues to use the standard
    Spearman's formula. See the TODO list for more discussion on this point.

  Questions about kocos.pl
   What is a kth order co-occurrence?
    Two words are 2nd order co-occurrences if they occur with words that
    occur with a particular word. For example, suppose that we observe the
    following bigrams in a corpus of text:

     telephone line
     busy line
     telephone operator

    These are first order co-occurrences.

    "telephone" and "busy" are said to be second order co-occurrences of
    each other since they both occur with "line". (Both are first order
    co-occurrences of "line"). Then "operator" is said to be a second order
    co-occurrence of "line" since it occurs with "telephone" which is a
    first order co-occurrence of "line".

    These relationships can be visualized as graphs or chains.

     line -> telephone -> operator

    kocos.pl finds such kth order relationships among words.

AUTHORS
     Ted Pedersen, University of Minnesota, Duluth
     tpederse at d.umn.edu         

     Satanjeev Banerjee, Carnegie-Mellon University

BUGS
SEE ALSO
     NSP home : L<http://ngram.sourceforge.net>
     NSP mailing list : L<http://groups.yahoo.com/group/ngram/>

COPYRIGHT
    Copyright (C) 2000-2010 Ted Pedersen

    Permission is granted to copy, distribute and/or modify this document
    under the terms of the GNU Free Documentation License, Version 1.2 or
    any later version published by the Free Software Foundation; with no
    Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

    Note: a copy of the GNU Free Documentation License is available on the
    web at <http://www.gnu.org/copyleft/fdl.html> and is included in this
    distribution as FDL.txt.