The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    README Introduction to Ngram Statistics Package (Text-NSP)

SYNOPSIS
    This document provides a general introduction to the Ngram Statistics
    Package.

DESCRIPTION
  1. Introduction
    The Ngram Statistics Package (NSP) is a suite of programs that aids in
    analyzing Ngrams in text files. We define an Ngram as a sequence of 'n'
    tokens that occur within a window of at least 'n' tokens in the text;
    what constitutes a "token" can be defined by the user.

    In earlier versions (v0.1, v0.3, v0.4) this package was known as the
    Bigram Statistics Package (BSP). The name change reflects the widening
    scope of the package in moving beyond Bigrams to Ngrams.

    NSP consists of two core programs and three utilities:

    Program count.pl takes flat text files as input and generates a list of
    all the Ngrams that occur in those files. The Ngrams, along with their
    frequencies, are output in descending order of their frequency.

    Program statistic.pl takes as input a list of Ngrams with their
    frequencies (in the format output by count.pl) and runs a user-selected
    statistical measure of association to compute a "score" for each Ngram.
    The Ngrams, along with their scores, are output in descending order of
    this score. The statistical score computed for each Ngram can be used to
    decide whether or not there is enough evidence to reject the null
    hypothesis (that the Ngram is not a collocation) for that Ngram.

    Various utility programs are found in bin/utils/ and take as their input
    the results (output) from count.pl and/or statistic.pl.

    rank.pl takes as input two files output by statistic.pl and computes the
    Spearman's rank correlation coefficient on the Ngrams that are common to
    both files. Typically the two files should be produced by applying
    statistic.pl on the same Ngram count file but by using two different
    statistical measures. In such a scenario, the value output by rank.pl
    can be used to measure how similar these the two measures are. A value
    close to 1 would indicate that these two measures rank Ngrams in the
    same order, -1 that the two orderings are exactly opposite to each other
    and 0 that they are not related.

    kocos.pl takes as input a file output by count.pl or statistic.pl and
    uses that to identify kth order co-occurrences of a given word. A kth
    order co-occurrence of a target WORD is a word that co-occurs with a
    (k-1)th co-occurrence of the given target WORD. So A is a 2nd order
    co-occurrence of X if X occurs with B and B occurs with A. Put more
    concretely in "New York", "New" and "York" co-occur (the are 1st order
    co-occurrences). In "New Jack", "New" and "Jack" co-occur. Thus, "Jack"
    and "York" are second order co-occurrences because they both co-occur
    with "New".

    combig.pl will take the output of count.pl and find unordered counts of
    bigrams. Normally count.pl treats bigrams like "fine wine" and "wine
    fine" as distinct. combig.pl (combine bigram) will adjust the counts
    such that they do not depend on the order. So one could then go on to
    measure how much the words "fine" and "wine" are associated without
    respect to their order.

    huge-count.pl allows a user to run count.pl on much larger corpora. It
    essentially divides the whole bigrams list generated by count.pl with
    --tokenlist opition, then splits the entire bigrams list into smaller
    pieces, and then sort and merge the bigrams lists to get the final
    output. huge-count.pl also uses bin/utils/huge-split.pl,
    bin/utils/huge-sort.pl, bin/utils/huge-merge.pl and
    bin/utils/huge-delete.pl.

    This README continues with an introduction to the basic definitions of
    tokens, the tokenization process and the Ngram formation process. This
    is followed by a description of the two main programs in this suite
    (count.pl and statistic.pl) and brief notes one how one could typically
    use each of them. The programs rank.pl, kocos.pl, and combig.pl are
    described in separate READMEs in the /utils directory.

  2. Tokens
    We define a token as a contiguous sequence of characters that match one
    of a set of regular expressions. These regular expressions may be
    user-provided, or, if not provided, are assumed to be the following two
    regular expressions:

      \w+        -> this matches a contiguous sequence of alpha-numeric characters

      [\.,;:\?!] -> this matches a single punctuation mark

    For example, assume the following is a line of text:

    "the stock markets fell by 20 points today!"

    Then, using the above regular expressions, we get the following tokens:

        the       stock     markets
        fell      by        20
        points    today     !

    Now assume that the user provides the following lone regular expression:

      [a-zA-Z]+  -> this matches a contiguous sequence of alphabetic characters

    Then, we get the following tokens:

        the       stock     markets
        fell      by        points
        today

  3. The Tokenization Process:
    Given a text file and a set of regular expressions, the text is
    "tokenized", that is, broken up into tokens. To do so, the entire input
    text is considered as one long "input string" with new-line characters
    being replaced by space characters (this is the default behaviour and
    can be modified; see point 4 below). Then, the following is done:

     while the input string is non empty

        foreach regular expression r
            if r is matched by a sequence of characters starting with the first
            character in the input string...
                quit this for loop
            end if
        end foreach

        if we have a matching regular expression r
            the portion of the input string matched by r is our next token. remove
            this token from the input string.
        else
            remove the first character from the input string
        end if

     end while

   3.1 Notes:
    3.1.1. In looking for a regular expression that yields a successful
    match (in the foreach loop above), we want a regular expression that
    matches the input string starting with the first character of the input
    string. Thus, the regular expression /b/ matches the input string "be
    good" but not the input string " be good".

    3.1.2. If none of the regular expressions give a successful match, then
    the first character in the input string is removed. This character is
    considered a "non-token" and is henceforth ignored.

    3.1.3. Since the matching process (the foreach loop above) stops at the
    first match, the order in which the regular expressions are tested is
    important. The order is exactly the order in which they are provided by
    the user, or if the default regular expressions are used, the order in
    which they are listed above.

   3.2 Examples:
   3.2.1 Example 1:
    3.2.1.1. Input text:

        why's the stock falling?

    3.2.1.2. Regular expressions:

        \w+
        [\.,;:\?!]

    3.2.1.3. Resulting tokens:

        why       s         the
        stock     falling   ?

    3.2.1.4. Explanation:

    Initially our input string is the entire input text: "why's the stock
    falling?". The first token found is "why" which matches the regular
    expression /\w+/. This token is removed, and our input string becomes
    "'s the stock falling?".

    Now neither of the regular expressions can match the ' character. Thus
    this character is considered a non-token and is removed, leaving the
    input string like so: "s the stock falling?".

    "s" is now matched by /\w+/, and this forms our next token. Upon
    removing this token, we get the following input string " the stock
    falling?".

    Again, neither of the regular expressions match this input string, and
    the leading space character is removed as a non-token. Similarly the
    rest of the line is tokenized to yield the tokens "the", "stock",
    "falling" and "?".

   3.2.2 Example 2:
    3.2.2.1. Input text:

        why's the stock falling?

    3.2.2.2. Regular expressions:

        /fall/
        /falling/
        /stock/

    3.2.2.3. Resulting tokens:

        stock     fall

    3.2.2.4. Explanation:

    Initially our input string is the entire input text: "why's the stock
    falling?". None of the regular expressions match, and we remove the
    first character to get as input string the following: "why's the stock
    falling?". Similarly, again the regular expressions don't match, and we
    have to remove the first character. This goes on until our input string
    becomes: "stock falling?".

    Now "stock" matches the regular expression /stock/, and this token is
    removed, leaving " falling?" as the input string. Since the space
    character does not form a token, it is removed. Now we have "falling?"
    as our input string.

    Now observe that we have two regular expressions, /fall/ and /falling/,
    both of which can match the input string. However, since /fall/ appears
    before /falling/ in the list, the token formed is "fall". This leaves
    our input string as: "ing?". None of the regular expressions match this
    or any of the subsequent input strings obtained by removing one by one
    the first characters. Hence we get as tokens "stock" and "fall".

   3.2.3 Example 3:
    3.2.3.1. Input text:

        why's the stock falling?

    3.2.3.2. Regular expressions:

        /falling/
        /fall/
        /stock/

    3.2.3.3. Resulting tokens:

        stock     falling

    3.2.3.4. Explanation:

    Observe that this example differs from the previous one only in the
    order of the regular expressions. The tokenization proceeds exactly as
    in the previous example, until we have as our input string "falling?".
    Here, we have /falling/ as our first regular expression, and so we get
    "falling" as our token.

    Examples 3.2.2 and 3.2.3 demonstrate the importance of the order in
    which the regular expressions are provided to the tokenization process.

   3.2.4. Example 4:
    3.2.4.1. Input text:

        why's the stock falling?

    3.2.4.2. Regular expressions:

        /the stock/
        /\w+/

    3.2.4.3. Resulting tokens:

        why       s       the stock
        falling

    3.2.4.4. Explanation:

    The thing to note here is that one of the regular expressions has an
    embedded space character in it. This causes no problems: our definition
    of a token allows embedded space characters in them! Once our input
    string is "the stock falling?", the regular expression /the stock/ is
    matched, and the string "the stock" forms our next token.

  4. Ngrams:
    An Ngram is a sequence of n tokens. We shall delimit tokens in an Ngram
    by the diamond symbol, i.e. "<>". Thus, "big<>boy<>" is a bigram whose
    tokens are "big" and "boy". Similarly, "stock<>falling<>?<>" is a
    trigram whose tokens are "stock" and "falling" and "?". "the
    stock<>falling<>" is a bigram with tokens "the stock" and "falling".

    Given a piece of text, Ngrams are usually formed of contiguous tokens.
    For instance, lets take example 3.2.1, where our tokens, in the order in
    which they appear in the text, are the following:

        why      s      the      stock      falling      ?

    Then, the following are all the bigrams:

        why<>s<>            s<>the<>        the<>stock<>
        stock<>falling<>    falling<>?<>

    The following are all the trigrams:

        why<>s<>the<>           s<>the<>stock<>
        the<>stock<>falling<>   stock<>falling<>?<>

    The following are all the 4-grams:

        why<>s<>the<>stock
        s<>the<>stock<>falling
        s<>the<>stock<>falling<>?<>

    Etcetera.

    The Ngrams shown above are all formed from contiguous tokens. Although
    this is the default, we also allow Ngrams to be formed from
    non-contiguous tokens.

    To do so, we first define a "window" of size k to be a sequence of k
    contiguous tokens, where the value of k is greater than or equal to the
    value of n for the Ngrams. An Ngram can be formed from any n tokens as
    long as all the tokens belong to a single window of size k. Further the
    n tokens must occur in the Ngram in exactly the same order as they occur
    in the window.

    Put another way, given a window of k tokens, we drop k-n tokens from the
    window, and what remains is an Ngram!

    Thus for instance, taking example 3.2.1 again, recall that our tokens in
    the order in which they occur in the text are the following:

        why      s      the      stock      falling      ?

    Then, the following are all the bigrams with a window size of 3:

        why<>s<>               why<>the<>         s<>the<>
        s<>stock<>             the<>stock<>       the<>falling<>
        stock<>falling<>       stock<>?<>         falling<>?<>

    The following are all the bigrams with a window size of 4:

        why<>s<>               why<>the<>         why<>stock<>
        s<>the<>               s<>stock<>         s<>falling<>
        the<>stock<>           the<>falling<>     the<>?<>
        stock<>falling<>       stock<>?<>         falling<>?<>

    The following are all the trigrams with a window size of 4:

        why<>s<>the<>          why<>s<>stock<>     why<>the<>stock<>
        s<>the<>stock<>        s<>the<>falling<>   s<>stock<>falling<>
        the<>stock<>falling<>  the<>stock<>?<>     the<>falling<>?<>
        stock<>falling<>?<>

    Etc.

  5. Program count.pl:
    This program takes as input a flat ASCII text file and outputs all
    Ngrams, or token sequences of length 'n', where the value of 'n' can be
    decided by the user. Non-contiguous Ngrams within a window of size 'k'
    as described above can also be found and output. For every output Ngram,
    its frequency of occurrence as well as the frequencies of all the
    combinations of the tokens it is made up of are output. Details follow.

   5.1. Default Way to Run count.pl:
    The most basic way of running this program is the following:

    Example 5.1: count.pl output.txt input.txt

    where input.txt is the input text file in which to find the Ngrams and
    output.txt is the output file into which count.pl will put all the
    Ngrams with their frequencies.

   5.2. Changing the Length of Ngrams and the Size of the Window:
    Several default values are in use when the program is run this way. For
    example it is assumed that one is counting bigrams, that is the value of
    'n' is 2. This can be changed by using the option --ngram N, where 'N'
    is the number of tokens you want in each Ngram. Thus, to find all
    trigrams in input.txt, run count.pl thus:

    Example 5.2: count.pl --ngram 3 output.txt input.txt

    Another default value in use is the window size. Window size defaults to
    the value of 'n' for Ngrams. Thus, in example 5.1 the window size was 2
    while in example 5.1, because of the --ngram 3 option , the window size
    was 3. This can be changed using the --window N option. Thus, for
    example to find all bigrams within windows of size 3, one would run the
    program like so:

    Example 5.3a: count.pl --window 3 output.txt input.txt

    Similarly, to find all trigrams within a window of size 4:

    Example 5.3b: count.pl --ngram 3 --window 4 output.txt input.txt

   5.3. Using User-Provided Token Definitions:
    In all these examples, the tokenization and Ngram formation proceeds as
    described in sections 3 and 4 above. In these examples, the default
    token definitions are used:

     \w+        -> this matches a contiguous sequence of alpha-numeric characters
     [\.,;:\?!] -> this matches a single punctuation mark

    As mentioned previously, these default token definitions can be
    over-ridden by using the option --token FILE, where FILE is the name of
    the file containing the regular expressions on which the token
    definitions will be based. Each regular expression in this FILE should
    be on a line of its own, and should be delimited by the forward slash
    '/'. Further, these should be valid Perl regular expressions, as defined
    in [1], which means for example that any occurrence of the forward slash
    '/' within the regular expression must be 'escaped'.

   5.4 Removing character strings via --nontoken option:
    This option allows a user to define regular expressions that will match
    strings that should not be considered as tokens. These strings will be
    removed from the data and not counted or included in Ngrams.

    The --nontoken option is recommended when there are predictable
    sequences of characters that you know should not be included as tokens
    for purposes of counting Ngrams, finding collocations, etc.

    For example, if mark-up symbols like <s>, <p>, [item], [/ptr] exist in
    text being processed, you may want to include those in your list of
    nontoken items so they are discarded. If not, a simple regex such as
    /\w+/ will match with 's', 'p', 'item', 'ptr' from these tags, leading
    to confusing results.

    The --nontoken option on the command line should be followed by a file
    name (NON_TOKEN). This file should contain Perl regular expressions
    delimited by forward slashes '/' that define non-tokens. Multiple
    expressions may be placed on separate lines or be separated via the '|'
    (Perl 'or') as in /regex1|regex2|../

    The following are some of the examples of valid non-token definitions.

     /<\/?s|p>/ : will remove xml tags like <s>, <p>, </s>, </p>.

     /\[\w+\]/  : will remove all words which appear in square brackets like
             [p], [item], [123] and so on.

    count.pl will first remove any string from the input data that matches
    the non-token regular expression, and only then will match the remaining
    data against the token definitions. Thus, if by chance a string matches
    both the token and nontoken definitions, it will be removed as
    --nontoken has a higher priority than --token or the default token
    definition.

   5.5. The Output Format of count.pl:
    Assume that the following are the contents of the input text file to
    count.pl; let us call the file test.txt:

     first line of text
     second line
     and a third line of text

    Further assume that count.pl is run like so:

     count.pl test.cnt test.txt

    Thus, test.cnt will have all the bigrams found in file test.txt using a
    window size of 2 and using the two default tokens as above. Following
    then are the contents of file test.cnt:

     11
     line<>of<>2 3 2
     of<>text<>2 2 2
     second<>line<>1 1 3
     line<>and<>1 3 1
     and<>a<>1 1 1
     a<>third<>1 1 1
     first<>line<>1 1 3
     third<>line<>1 1 3
     text<>second<>1 1 1

    The number on the first line, 11, indicates that there were total 11
    bigrams in the input file.

    From the next line onwards, the various bigrams found are listed. Recall
    that the tokens of the Ngrams are delimited by the diamond signs: <>.
    Thus the bigram on the first line is line<>of<>, made up of the tokens
    "line" and "of" in that order; the bigram on the second line is
    of<>text<>, made up of the tokens "of" and "text", etc.

    After the diamond following the last token there are three numbers. The
    first of these numbers denotes the number of times this Ngram occurs in
    the input text file. Thus bigram line<>of<> occurs 2 times in the input
    file, as does bigram of<>text<>. The second number denotes in how many
    bigrams the token "line" occurs as the left-hand-token. In this case,
    "line" occurs on the left of three bigrams, namely two copies of bigram
    "line<>of" and the bigram "line<>and<>". Similarly, the third number
    denotes the number of bigrams in which the word "of" occurs as the
    right-hand-token. In this case, "of" occurs on the right of two bigrams,
    namely the two copies of the bigram "line<>of<>".

    Similar output is obtained for trigrams. Assume again that the input
    file is above, and assume that count.pl is run thusly:

     count.pl --ngram 3 test.cnt test.txt

    The output test.cnt file is as follows:

     10
     line<>of<>text<>2 3 2 2 2 2 2
     and<>a<>third<>1 1 1 1 1 1 1
     third<>line<>of<>1 1 3 2 1 1 2
     second<>line<>and<>1 1 3 1 1 1 1
     line<>and<>a<>1 3 1 1 1 1 1
     a<>third<>line<>1 1 1 2 1 1 1
     text<>second<>line<>1 1 1 2 1 1 1
     of<>text<>second<>1 1 1 1 1 1 1
     first<>line<>of<>1 1 3 2 1 1 2

    Once again, the number on the first line says that there are 10 trigrams
    in the input text file. The first trigram in the list is
    "line<>of<>text<>" made up of the tokens "line", "of" and "text" in that
    order. Similarly, the next trigram is "and<>a<>third<>" made of the
    tokens "and", "a" and "third".

    Observe that this time there are more numbers after the last token. The
    first number denotes, as before, the number of times this trigram occurs
    in the input text file. Thus, "line<>of<>text" occurs twice in the input
    file while "and<>a<>third" occurs just once. The second, third and
    fourth numbers denote the number of trigrams in which the tokens "line",
    "of" and "text" appear in the first, second and third positions
    respectively. Thus, "line" occurs as the token in the first position in
    3 trigrams, namely 2 copies of "line<>of<>text<>" and one copy of
    "line<>and<>a<>". Similarly, the tokens "of" and "text" appear as the
    second and third tokens respectively of two bigrams, namely the two
    copies of "line<>of<>text<>".

    The fifth number denotes the number of bigrams in which "line" occurs as
    the first token and "of" occurs as the second token. Once again, there
    are only two trigrams in which this happens: the two copies of
    "line<>of<>text<>". The sixth number denotes the number of bigrams in
    which "line" occurs as the token in the first place and "text" occurs as
    the token in the third place. The seventh number denotes the number of
    bigrams in which "of" occurs as the token in the second place and "text"
    occurs as the token in the third place.

    In general, assume we are dealing with Ngrams of size 'n'. Given an
    Ngram, denote its leftmost token as w[0], the next token as w[1], and so
    on until w[n-1]. Further let f(a, b, ..., c) be the number of Ngrams
    that have token w[a] in position a, token w[b] in position b, ... and
    token w[c] in position c, where 0 <= a < b < ... < c < n.

    Then, given an ngram, the first frequency value reported is f(0, 1, ...,
    n-1).

    This is followed by n frequency values, f(0), f(1), ..., f(n-1).

    This is followed by (n choose 2) values, f(0, 1), f(0, 2), ..., f(0,
    n-1), f(1, 2), ..., f(1, n-1), ... f(n-2, n-1).

    This is followed by (n choose 3) values, f(0, 1, 2), f(0, 1, 3), ...,
    f(0, 1, n-1), f(0, 2, 3), ..., f(0, 2, n-1), ..., f(0, n-2, n-1), ...,
    f(1, 2, 3), ..., f(n-3, n-2, n-1).

    And so on, until (n choose n-1), that is n, frequency values f(0, 1,
    ..., n-2), f(0, 1, ..., n-3, n-1), f(0, 1, ..., n-4, n-2, n-1), ...,
    f(1, 2, ..., n-1).

    This gives us a total of 2^n-1 possible frequency values. We call each
    such frequency value a "frequency combination", since it expresses the
    number of Ngrams that has a given combination of one or more tokens in
    one or more fixed positions. By default all such combinations are
    printed, exactly in the order showed above. To see which combinations
    are being printed one could use the option --get_freq_combo FILE. This
    prints to the file the inputs to the imaginary 'f' function defined
    above exactly in the order the frequency values occur in the main
    output. Thus for instance, running the program like so:

     count.pl --get_freq_combo freq_combo.txt test.cnt test.txt

    Assuming that test.txt file is the one shown above, the following output
    is created in file freq_combo.txt:

     0 1
     0
     1

    and the following output in file test.cnt:

     11
     line<>of<>2 3 2
     of<>text<>2 2 2
     second<>line<>1 1 3
     line<>and<>1 3 1
     and<>a<>1 1 1
     a<>third<>1 1 1
     first<>line<>1 1 3
     third<>line<>1 1 3
     text<>second<>1 1 1

    Recall that since the option --ngram is not being used, the default
    value of n, 2, is being used here. After each bigram in the test.cnt
    file are three numbers; the first number corresponds to f(0, 1), the
    second number corresponds to f(0) and the third to f(1). Observe that
    line 'i' of the output in file freq_combo.txt file represents the input
    to the imaginary 'f' function that creates the 'i_th' frequency value on
    each line of the output in file test.cnt.

    Similarly, running the program thus:

     count.pl --ngram 3 --get_freq_combo freq_combo.txt test.cnt test.txt

    produces the following output in freq_combo.txt:

     0 1 2
     0
     1
     2
     0 1
     0 2
     1 2

    and the following output in file test.cnt

     10
     line<>of<>text<>2 3 2 2 2 2 2
     and<>a<>third<>1 1 1 1 1 1 1
     third<>line<>of<>1 1 3 2 1 1 2
     second<>line<>and<>1 1 3 1 1 1 1
     line<>and<>a<>1 3 1 1 1 1 1
     a<>third<>line<>1 1 1 2 1 1 1
     text<>second<>line<>1 1 1 2 1 1 1
     of<>text<>second<>1 1 1 1 1 1 1
     first<>line<>of<>1 1 3 2 1 1 2

    The seven numbers after each trigram in file test.cnt correspond
    respectively to f(0, 1, 2), f(0), f(1), f(2), f(0, 1), f(0, 2) and f(1,
    2), as shown in the file freq_combo.txt.

    It is possible that the user may not require all the frequency values
    output by default, or that the user requires the frequency values in a
    different order. To change the default frequency values output, one may
    provide count.pl with a file containing the inputs to the 'f' function
    using the option --set_freq_combo.

    Thus for instance, if the user wants to create trigrams, and only
    requires the frequencies of the trigrams and the frequency values of the
    three tokens in the trigrams (and not of the pairs of tokens), then he
    may create the following file (say, user_freq_combo.txt):

     0 1 2
     0
     1
     2

    and provide this file to the count.pl program thus:

    count.pl --ngram 3 --set_freq_combo user_freq_combo.txt test.cnt
    test.txt

    this produces the following test.cnt file:

     10
     line<>of<>text<>2 3 2 2
     and<>a<>third<>1 1 1 1
     third<>line<>of<>1 1 3 2
     second<>line<>and<>1 1 3 1
     line<>and<>a<>1 3 1 1
     a<>third<>line<>1 1 1 2
     text<>second<>line<>1 1 1 2
     of<>text<>second<>1 1 1 1
     first<>line<>of<>1 1 3 2

    Observe that the only difference between this output and the default
    output is that instead of reporting 7 frequency values per ngram, only
    the 4 requested are output.

    count2huge.pl is a method to convert the output of count.pl to
    huge-count.pl. The program can sort the bigrams in the alphabet order
    and generate the same output with huge-count.pl. The reason we sort the
    bigrams is because when we use the bigrams list to generate
    co-occurrence matrix for the vector relatedness measure of
    UMLS-Similarity, it requires the input bigrams which start with the same
    term are grouped together. Sort the bigrams when create the
    co-occurrence can imporve the efficiency.

   5.6. "Stopping" the Ngrams:
    The user may "stop" the Ngrams formed by count.pl by providing a list of
    stop-tokens through the option --stop FILE. Each stop token in FILE
    should be a Perl regular expression that occurs on a line by itself.
    This expression should be delimited by forward slashes, as in /REGEX/.
    All regular expression capabilities in Perl are supported except for
    regular expression modifiers (like the "i" /REGEX/i).

    The following are a few examples of valid entries in the stop list.

     /^\d+$/
     /\bthe\b/
     /\b[Tt][Hh][Ee]\b/
     /^and$/
     /\bor\b/
     /^be(ing)?$/

    There are two modes in which a stop list can be used, AND and OR. The
    default mode is AND, which means that an Ngram must be made up entirely
    of words from the stoplist before it is eliminated. The OR mode
    eliminates an Ngram if any of the words that make up the Ngram are found
    in the stoplist.

    The mode is specified via an extended option that should appear on the
    first line of the stop file. For example,

     @stop.mode=AND
     /^for$/
     /^the$/
     /^\d+$/

    would eliminate bigrams such as 'for the', 'for 10', etc. (where both
    elements of the bigram are from the stop list.) But will not remove
    bigrams like '10 dollars' or 'of the'.

     @stop.mode=OR
     /^for$/
     /^the$/
     /^\d+$/

    would eliminate bigrams such as 'for our', '10 dollars', etc. (where at
    least one element of the bigram is from the stop list).

    If the @stop.mode= option is not specified, the default value is AND.

    In both modes, Ngrams that are eliminated do not add to the various
    Ngram and individual word frequency counts. Ngrams that are "stoplisted"
    are treated as if they never existed and are not counted.

   5.6.1 Usage Notes for Regular Expressions in Stop Lists:
    (1) In Perl regular expressions, \b specifies word boundary and ^ and $
    specify the start and end of a string (or line of text). These can be
    used in defining your stop list entries, but must be used with somewhat
    carefully.

    count.pl examines each token individually, thereby treating each as a
    separate string or line. As a result, you can use either /\bregex\b/ or
    /^regex$/ to exactly match a token made up of alphanumeric characters,
    as in \bcat\b or \^cat$\. However, please note that if a token consists
    of other characters (as in n.b.a.) they can behave differently. Suppose
    for example that your token is www.dot.com. If you have a stop list
    entry \bwww\b it will match the 'www' portion of the token, since the
    '.' is considered to be a word boundary. \^www$\ would not have that
    problem.

    (2) If instead of /^the$/, regex /the/ is used as a stop regex, then
    every token that matches /the/ will be removed. So tokens like 'there',
    'their', 'weather','together' will be excluded with the stop regex
    /the/. On the other hand, with the regex /^the$/, all occurrences of
    only word 'the' will be removed.

    (3) You can also use a stop regex /^the/ to remove tokens that begin
    with 'the' like 'their' or 'them' but not 'together'. Similarly, stop
    regex /the$/ will remove all tokens which end in 'the' like 'swathe' or
    'tithe' but not 'together' or 'their'.

    (4) Please note that stoplist handling changed as of version 0.53. If
    you use a stoplist developed for an earlier version of NSP, then it will
    not behave in the same way!!

    In earlier versions when you specified /regex/ as a stoplist item, we
    assumed that you really meant /\bregex\b/ and proceeded accordingly.
    However, since regular expressions are now fully supported we require
    that you specify exactly what you mean. So if you include /is/ as a
    member of your stoplist, we will now assume that you mean any word that
    contains 'is'somewhere within in (like 'this' or 'kiss' or 'isthmus'
    ...) To preserve the functionality of your old stoplists, simply convert
    them from

     /the/
     /is/
     /of/

    to

     /\bthe\b/
     /\bis\b/
     /\bof\b/

    (6) regex modifiers like i or g which come after the end slash like:

     /regex/i
     /regex/g

    are not supported. See FAQ.txt for an explanation.

    This makes it slightly inconvenient to specify that you would like to
    stop any form of a given word. For example, if you wanted to stop 'THE',
    'The', 'THe', etc. you would have to specify a regex such as

     /[Tt][Hh][Ee]/

   5.6.2. Differences between --nontoken and --stop:
    In theory we can remove "unwanted" words using either the --nontoken
    option or the --stop option. However, these are rather different
    techniques.

    --stop only removes stop words after they are recognized as valid
    tokens. Thus, if you wish to remove some markup tags like [p] or [item]
    from the data using a stop list, you first need to recognize these as
    tokens (via a --token definition like /\[\w+\]/) and then remove them
    with a --stop list.

    In addition, the --stop option operates on an Ngram and does not remove
    individual words. It removes Ngrams (and reduces the count of the number
    of Ngrams in the sample). In other words, the --stop option only comes
    into effect after the Ngrams have been created.

    On the other hand, the --nontoken option eliminates individual
    occurrence of a non-token sequence before finding Ngrams.

    Some examples to clarify the distinction between --stop and --nontoken

    -----------------------------------------------------------------------

    Consider an input file count.input =>

      [ptr] <s> this is a test written for count.pl </s> [/ptr]
      their them together wither tithe

    NontokenFile nontoken.regex =>

      /\[\/?\w+\]/
      /<\/?\w+>/

    case (a) StopFile stopfile.txt => /the/
    ----------------------------------------

    Running count.pl with the command :

     count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input

    will first remove all nontokens from the input file. Hence the tokenized
    text from which the bigrams will be created will be =>

      this is a test written for count.pl
      their them together wither tithe

    Since the StopFile contains /the/ all tokens which include 'the' are
    eliminated. Thus, the bigrams:

     their<>them<>
     them<>together<>
     together<>wither<>
     wither<>tithe<>

    will all be removed. This is because each word in each bigram contains
    "the" and the default stop mode is AND. Note that if there was a bigram
    such as "on<>their<>" it would not be removed since both words to not
    match the stoplist. The output file count.out will contain the
    following:

     count.out=>

     9
     test<>written<>1 1 1
     this<>is<>1 1 1
     a<>test<>1 1 1
     is<>a<>1 1 1
     for<>count<>1 1 1
     .<>pl<>1 1 1
     count<>.<>1 1 1
     written<>for<>1 1 1
     pl<>their<>1 1 1

    case (b) StopFile stopfile.txt => /^the/

    ----------------------------------------

    Running count.pl with the command:

     count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input

    will first remove all nontokens from the input file. The tokenized text
    will be:

            this is a test written for count.pl
            their them together wither tithe

    Since the StopFile contains /^the/, all tokens which begin with "the"
    are eliminated. Thus, the bigram

     their<>them<>

    will be removed since it consists of two words that begin with "the".
    The output file count.out will contain the 12 bigrams as shown below.

     count.out=>

     12
     test<>written<>1 1 1
     this<>is<>1 1 1
     a<>test<>1 1 1
     is<>a<>1 1 1
     for<>count<>1 1 1
     them<>together<>1 1 1
     .<>pl<>1 1 1
     count<>.<>1 1 1
     written<>for<>1 1 1
     pl<>their<>1 1 1
     wither<>tithe<>1 1 1
     together<>wither<>1 1 1

     case (c) StopFile stopfile.txt => @stop.mode=OR
              /the$/

    ------------------------------------------------

    Running count.pl with the command:

     count.pl --stop stopfile.txt --nontoken nontoken.regex count.out count.input

    will first remove all nontokens from the input file. Hence the tokenized
    text will be:

            this is a test written for count.pl
            their them together wither tithe

    As the StopFile contains /the$/ all tokens which end in 'the' are stop
    words. Thus, in the bigram

     wither<>tithe<>

    "tithe" will match the stoplist since it ends with "the". However, this
    bigram will be eliminated since the stop mode is OR (meaning that if
    either word is in the stop list then the bigram is eliminated). The
    output file count.out will contain the 12 bigrams as shown below.

     count.out=>

     12
     test<>written<>1 1 1
     this<>is<>1 1 1
     a<>test<>1 1 1
     is<>a<>1 1 1
     for<>count<>1 1 1
     them<>together<>1 1 1
     .<>pl<>1 1 1
     their<>them<>1 1 1
     count<>.<>1 1 1
     written<>for<>1 1 1
     pl<>their<>1 1 1
     together<>wither<>1 1 1

   5.7. Removing and Not Displaying Low Frequency Ngrams:
    We allow the user to either remove or to not display low frequency
    Ngrams. The user can remove low frequency Ngrams by using the option
    --remove N by which all Ngrams that occur less than n times are removed.
    The Ngram and the individual frequency counts are adjusted accordingly
    upon the removal of these Ngrams.

    The user can choose not to display low frequency Ngrams by using the
    option --frequency N, by which Ngrams that occur less than n times are
    not displayed in the output. Note that this differs from the --remove
    option above in that the various frequency counts are not changed.
    Intuitively, we continue to believe that these Ngrams have occurred in
    the text - we are simply not interested in looking at them. By contrast,
    in the --remove option we want to actually think that the Ngrams didn't
    occur in the text in the first place, and so we want our numbers to
    agree to that too!

   5.8. Extended Output:
    Observe that one may modify the actual counting process in various ways
    through the various options above. To keep a "record" of which option
    were used and with what values, one can turn the "extended" output on
    with the switch --extended. The extended output records the size of the
    Ngram, the size of the window, the frequency value at which the Ngrams
    were removed and a list of all the source files used to create the count
    output. If a switch was not used, the default value is printed.

   5.9. Histogram Output:
    The user can also generate a "histogram" output by using the --histogram
    FILE option. This histogram output shows how many times Ngrams of a
    certain frequency has occurred. Following is a typical line out of a
    histogram output:

     Number of n-grams that occurred   5 time(s) =    14 (40.94 percent)

    This says that there were 14 distinct Ngrams that occurred 5 times each,
    and between themselves they make up around 41% of the total number of
    Ngrams.

   5.10. Searching for Source Files in Directories, Recursively if Need Be:
    One would usual provide a source file to create Ngrams from. One could
    also provide a directory name - all text files from the directory are
    used to create Ngrams from. Along with a directory name if one also uses
    the switch --recurse, all subdirectories inside the source directory are
    searched for text files recursively, and all text files so found are
    used to create Ngrams from.

  6. Program statistic.pl:
    Program statistic.pl takes as input a list of Ngrams with their
    frequencies in the format output by count.pl and runs a user-selected
    statistical measure of association to compute a "score" for each Ngram.
    The Ngrams, along with their scores, are output in descending order of
    this score.

    The statistical measures of association are implemented separately in
    separate Perl packages (files ending with .pm extension). When running
    statistic.pl, the user needs to provide the name of a statistical
    measure (either from among the ones provided as a part of this
    distribution or those written by the user). Say the name of the
    statistic provided by the user is X. Program statistic.pl will then look
    for Perl package X.pm (in the current directory, or, failing that, the
    system path). If found, this Perl package file will be loaded and then
    used to calculate the statistic on the list of Ngrams provided.

    Please remember to include the path of Measures Directory (in the main
    NSP Package directory) in your system path. This will enable the
    statistic.pl program to find the modules provided with this package.

    As a part of this distribution, we provide the following statistical
    packages: dice, log-likelihood (ll), mutual information (mi), the
    chi-squared test (x2), and the left-fisher test of associativity
    (leftFisher). All these packages follow a fixed set of rules as
    discussed below. It is hoped that these rules are easy to follow and
    that new packages may be written quickly and easily.

    In a sense, program statistic.pl is framework. Its job is to take as
    input Ngrams with their frequencies, to provide those frequencies to the
    statistical library and to format the output from that library. The
    heart of the statistical measure - the actual calculation - lies in the
    library that can be plugged in. This framework allows for quickly
    rigging up new measures; to do so one need worry only about the actual
    calculation, and not of the various mundane issues that are taken care
    of by statistic.pl.

    This section follows with details on how to run statistic.pl, and then
    the format of the libraries and tips on how to write them.

   6.1. Default Way to Run statistic.pl:
    The default way to run statistic.pl is so:

    statistic.pl dice test.dice test.cnt

      where: dice      is the name of the statistic library to be loaded.
            test.dice is the name of the output file in which the results
                      of applying the dice coefficient will be stored.
            test.cnt  is the name of the input file containing the Ngrams
                      and their various frequency values.

    A Perl package with filename dice.pm is searched for in the Perl @INC
    path. Instead of writing just "dice" on the command line, one may also
    write the file name "dice.pm", or the full measure name
    "Text::NSP::Measures::2D::Dice::dice".

    Once such a file is found, it is exported into statistic.pl and tests
    are done to see if this file has the minimum requirements for a
    statistical library (more details below). If these tests fail,
    statistic.pl stops with an error message. Otherwise the library is
    initialized and then for each Ngram in file test.cnt, its frequency
    values are passed to it and its calculated value is noted. Finally, when
    all values have been calculated, the Ngrams are sorted on their
    statistic value and output to file test.dice.

    For example, assume our input test.cnt file is this:

      11
      line<>of<>2 3 2
      of<>text<>2 2 2
      second<>line<>1 1 3
      line<>and<>1 3 1
      and<>a<>1 1 1
      a<>third<>1 1 1
      first<>line<>1 1 3
      third<>line<>1 1 3
      text<>second<>1 1 1

    Thus there are 11 bigrams, the first of which is "line<>of<>", the
    second "of<>text<>" etc.

    Running statistic.pl thusly: statistic.pl dice test.dice test.cnt will
    produce the following test.dice file:

      11
      of<>text<>1 1.0000 2 2 2
      and<>a<>1 1.0000 1 1 1
      a<>third<>1 1.0000 1 1 1
      text<>second<>1 1.0000 1 1 1
      line<>of<>2 0.8000 2 3 2
      third<>line<>3 0.5000 1 1 3
      line<>and<>3 0.5000 1 3 1
      second<>line<>3 0.5000 1 1 3
      first<>line<>3 0.5000 1 1 3

    Once again, the first number is the total number of bigrams - 11. On the
    next line is the highest ranked bigram "of<>text<>". The first number
    following this bigram, 1, is its rank. The next number, 1.0000, is its
    value computed using the dice statistic. The final three numbers are
    exactly the numbers associated with this Ngram in the test.cnt file.

    Observe that three other bigrams also have the same score of 1.000 and
    so the same rank 1. The bigram with the next highest score of 0.8000,
    "line<>of<>", is ranked 2nd instead of 5th. This is a feature of our
    ranking mechanism; the fact that a bigram has a rank 'r' implies that
    there are r-1 distinct scores greater than the score of this Ngram. It
    does not imply that there are r-1 bigrams with higher scores.

   6.2. Changing the Default Ngram Size:
    By default, the Ngrams in the input file are assumed to be bigrams. This
    can however be changed by using the option --ngram. Given an Ngram size
    (either by default or by using the --ngram option), statistic.pl checks
    if there are exactly the correct number of tokens in each Ngram. If this
    is not true, an error is printed and statistic.pl halts.

   6.3. Defining the Meaning of the Frequency Values:
    The "meaning" of the various frequency values after each Ngram in the
    input file is important in that the statistic calculated depends on
    them. By default, the default meanings as defined by count.pl are
    assumed.

    count.pl and all statistical libraries (.pm modules) provided with this
    package are implemented such that they produce/accept the frequency
    values in the same order. So for an ngram,

                word1<>word2<>...wordn-1<>

    "the first frequency value reported is f(0,1,...n-1); this is the
    frequency of the Ngram itself. This is followed by n frequency values
    f(0), f(1),...f(n-1); these are the frequencies of the individual tokens
    in their specific positions in the given Ngram. This is followed by (n
    choose 2) values, f(0,1), f(0,2), ..., f(0,n-1), f(1,2), ..., f(1,n-1),
    ... f(n-2,n-1). This is followed by (n choose 3) values, f(0,1,2),
    f(0,1,3), ..., f(0,1,n-1), f(0,2,3), ... , f(0,2,n-1), ... f(0,n-2,n-1),
    f(1,2,3), ..., f(n-3,n-2,n-1). And so on, until (n choose n-1), that is
    n, frequency values f(0,1,...n-2), f(0,1,..n-3,n-1), f(0,1,...n-4,n-1),
    ..., f(1,2,...n-1)"

    (The above explanation is from "The Design, Implementation and Use of
    the Ngram Statistics Package" [2].)

    So the bigram output of count.pl/bigram input to any statistical library
    will be something like -

        word1<>word2<>f(0,1)<>f(0)<>f(1)

    Or you can also view this as

          word1<>word2<>n11<>n1p<>np1

    where n1p,np1 represent marginal totals in a 2x2 contingency table.

    Similarly, the trigram output of count.pl/trigram input to ll3.pm (which
    is the only trigram statistical library currently provided) will be -

        word1<>word2<>word3<>f(0,1,2)<>f(0)<>f(1)<>f(2)<>f(0,1)<>f(0,2)<>f(1,2)

    Or you can also view this as
    word1<>word2<>word3<>n111<>n1pp<>np1p<>npp1<>n11p<>n1p1<>np11

    where n1pp,np1p,npp1,n11p,n1p1,np11 represent marginal frequencies in a
    3x3 contingency table.

    The frequency combinations being used can be output to a file by using
    the option get_freq_combo.

    If count.pl was run with a set of user-defined frequency combinations
    different from the defaults, then the file containing these frequency
    combinations must be provided to statistic.pl using the option
    set_freq_combo.

    If the number of frequency values does not match the number expected
    (either through the default frequency combinations or through the user
    defined ones provided through the set_freq_combo option) then an error
    is reported. Besides checking that the number of frequency values is
    correct, nothing else is checked.

   6.4. Modifying the Output of statistic.pl:
    One may request statistic.pl to ignore all Ngrams which have a frequency
    less than a user-defined threshold by using the --frequency option. To
    be able to do this however, the Ngram frequency should be present among
    the various frequency values in the input Ngram file. It is possible to
    set up a frequency combination file that prevents count.pl from printing
    the actual frequency of each Ngram; if such a file is given to
    statistic.pl, the frequency cut-off requested through option --frequency
    will be ignored and a warning issued to that effect.

    Once the statistical values for the Ngrams are calculated and the Ngrams
    have been ranked according to these values, one may request not to print
    Ngrams below a certain rank. This can be done using the option --rank.
    Unlike the frequency cut-off above, all calculations are done and then
    Ngrams that fall below a certain rank are cut-off. In the frequency
    cut-off, calculations are not performed on the Ngrams that are ignored.

    The value returned by the statistic libraries may be floating point
    numbers; by default 4 places of decimal are shown. This can be changed
    by using the option --precision through which the user can decide how
    many places of decimal he wishes to see. Note that the values returned
    by the library are rounded to the places of decimal requested by the
    user, and THEN the ranking is done. Thus two Ngram that actually have
    different scores, but whose scores both round up to the same number for
    the given precision will get the same rank!

    The user can also use the statistical score to cut off Ngrams. Thus,
    using the option --score, one may request statistic.pl to not print
    Ngrams that get a score less than the given threshold.

    Similar to count.pl, the user can request statistic.pl to print extended
    information by using the --extended switch. Without this switch, all
    extended information already in the input file will be lost; with it,
    they will all be preserved and new extended data will be output.

    The output of statistic.pl is not formatted for human eyes - this can be
    done using the switch --format. Columns will be aligned as much as
    possible and the output is (often) neater than the default output.

   6.5. The Measures of Association Provided in This Distribution:
    We provide the 10 measures of association with this distribution. Nine
    are suitable for use with bigrams and one may be used with trigrams.

    The bigram measures are:

    *   Dice Coefficient (Text::NSP::Measures::2D::Dice::dice)

    *   Fishers exact test - left sided
        (Text::NSP::Measures::2D::Fisher::left)

    *   Fishers exact test - right sided
        (Text::NSP::Measures::2D::Fisher::right)

    *   Fishers twotailed test - right sided
        (Text::NSP::Measures::2D::Fisher::twotailed)

    *   Jaccard Coefficient (Text::NSP::Measures::2D::Dice::jaccard)

    *   Log-likelihood ratio (Text::NSP::Measures::2D::MI::ll)

    *   Mutual Information (Text::NSP::Measures::2D::MI::tmi)

    *   Odds Ratio (Text::NSP::Measures::2D::odds)

    *   Pointwise Mutual Information (Text::NSP::Measures::2D::MI::pmi)

    *   Phi Coefficient (Text::NSP::Measures::2D::CHI::phi)

    *   Pearson's Chi Squared Test (Text::NSP::Measures::2D::CHI::x2)

    *   Poisson Stirling Measure (Text::NSP::Measures::2D::MI::ps)

    *   T-score (Text::NSP::Measures::2D::CHI::tscore)

    The trigram measures are:

    *   Log-likelihood ratio (Text::NSP::Measures::3D::MI::ll)

    *   Mutual Information (Text::NSP::Measures::3D::MI::tmi)

    *   Pointwise Mutual Information (Text::NSP::Measures::3D::MI::pmi)

    *   Poisson Stirling Measure (Text::NSP::Measures::3D::MI::ps)

    The 4-gram measures is:

    *   Log-likelihood ratio (Text::NSP::Measures::4D::MI::ll)

    Any of these measures can be used as follows:

      statistic.pl XXXX output.txt input.txt

    where XXXX is the name of the measure.

    More information on how to write a new statistic library is provided in
    the documentation (perldoc) of Text::NSP::Measures. A few additional
    details about the Measures can be found in their respective perldocs.

  7. Referencing:
    If you write a paper that has used NSP in some way, we'd certainly be
    grateful if you sent us a copy and referenced NSP. We have a published
    paper about NSP that provides a suitable reference:

     @inproceedings{BanerjeeP03,
            author = {Banerjee, S. and Pedersen, T.},
            title = {The Design, Implementation, and Use of the {N}gram {S}tatistic {P}ackage},
            booktitle = {Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics},
            pages = {370-381},
            year = {2003},
            month ={February},
            address = {Mexico City}}

    This paper can be found at :

    <http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.13/doc/cicling2003.p
    s>

    or

    <http://cpansearch.perl.org/src/TPEDERSE/Text-NSP-1.13/doc/cicling2003.p
    df>

AUTHORS
    Ted Pedersen, University of Minnesota, Duluth tpederse at d.umn.edu

    Satanjeev Banerjee

    Amruta Purandare

    Saiyam Kohli

    Last modified by : $Id: README.pod,v 1.12 2010/04/26 18:30:46 liux0395
    Exp $

BUGS
    Please report to the NSP mailing list

SEE ALSO
    *   NSP Home: <http://ngram.sourceforge.net>

    *   Mailing List : <http://groups.yahoo.com/group/ngram/>

  8. Acknowledgments:
    This work has been partially supported by a National Science Foundation
    Faculty Early CAREER Development award (\#0092784) and by a Grant-in-Aid
    of Research, Artistry and Scholarship from the Office of the Vice
    President for Research and the Dean of the Graduate School of the
    University of Minnesota.

COPYRIGHT
    Copyright (C) 2000-2010, Ted Pedersen, Satanjeev Banerjee, Amruta
    Purandare, Bridget Thomson-McInnes Saiyam Kohli, and Ying Liu

    This program is free software; you can redistribute it and/or modify it
    under the terms of the GNU General Public License as published by the
    Free Software Foundation; either version 2 of the License, or (at your
    option) any later version.

    This program is distributed in the hope that it will be useful, but
    WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General
    Public License for more details.

    You should have received a copy of the GNU General Public License along
    with this program; if not, write to

        The Free Software Foundation, Inc.,
        59 Temple Place - Suite 330,
        Boston, MA  02111-1307, USA.

    Note: a copy of the GNU General Public License is available on the web
    at <http://www.gnu.org/licenses/gpl.txt> and is included in this
    distribution as GPL.txt.