Simon Cozens > Text-Ngram-0.03 > Text::Ngram

Download:
Text-Ngram-0.03.tar.gz

Dependencies

Annotate this POD

Related Modules

Data::Dumper
more...
By perlmonks.org

CPAN RT

Open  0
View/Report Bugs
Module Version: 0.03   Source   Latest Release: Text-Ngram-0.08

NAME ^

Text::Ngram - Basis for n-gram analysis

SYNOPSIS ^

  use Text::Ngram qw(ngram_counts add_to_counts);
  my $text   = "abcdefghijklmnop";
  my $hash_r = ngram_counts($text, 3); # Window size = 3
  # $hash_r => { abc => 1, bcd => 1, ... }

  add_to_counts($more_text, 3, $hash_r);

DESCRIPTION ^

n-Gram analysis is a field in textual analysis which uses sliding window character sequences in order to aid topic analysis, language determination and so on. The n-gram spectrum of a document can be used to compare and filter documents in multiple languages, prepare word prediction networks, and perform spelling correction.

The neat thing about n-grams, though, is that they're really easy to determine. For n=3, for instance, we compute the n-gram counts like so:

    the cat sat on the mat
    ---                     $counts{"the"}++;
     ---                    $counts{"he "}++;
      ---                   $counts{"e c"}++;
       ...

This module provides an efficient XS-based implementation of n-gram spectrum analysis.

There are two functions which can be imported:

    $href = ngram_counts($text[, $window]);

This first function returns a hash reference with the n-gram histogram of the text for the given window size. If the window size is omitted, then 5-grams are used. This seems relatively standard.

    add_to_counts($more_text, $window, $href)

This incrementally adds to the supplied hash; if $window is zero or undefined, then the window size is computed from the hash keys.

Important note on text preparation ^

Most of the published algorithms for textual n-gram analysis assume that the only characters you're interested in are alphabetic characters and spaces. So before the text is counted, the following preparation is made.

All characters are lowercased; (most papers use upper-casing, but that just feels so 1970s) punctuation and numerals are replaced by stop characters flanked by blanks; multiple spaces are compressed into a single space.

After the counts are made, n-grams containing stop characters are dropped from the hash.

If you prefer to do your own text preparation, use the internal routines process_text and process_text_incrementally instead of count_ngrams and add_to_counts respectively.

SEE ALSO ^

Cavnar, W. B. (1993). N-gram-based text filtering for TREC-2. In D. Harman (Ed.), Proceedings of TREC-2: Text Retrieval Conference 2. Washington, DC: National Bureau of Standards.

Shannon, C. E. (1951). Predication and entropy of printed English. The Bell System Technical Journal, 30. 50-64.

Ullmann, J. R. (1977). Binary n-gram technique for automatic correction of substitution, deletion, insert and reversal errors in words. Computer Journal, 20. 141-147.

SUPPORT ^

Beep... beep... this is a recorded announcement:

I've released this software because I find it useful, and I hope you might too. But I am a being of finite time and I'd like to spend more of it writing cool modules like this and less of it answering email, so please excuse me if the support isn't as great as you'd like.

Nevertheless, there is a general discussion list for users of all my modules, to be found at http://lists.netthink.co.uk/listinfo/module-mayhem

If you have a problem with this module, someone there will probably have it too.

AUTHOR ^

Simon Cozens, simon@cpan.org

COPYRIGHT AND LICENSE ^

Copyright 2003 by Simon Cozens

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: