NAME

Text::Ngram - Basis for n-gram analysis

SYNOPSIS

  use Text::Ngram qw(ngram_counts add_to_counts);
  my $text   = "abcdefghijklmnop";
  my $hash_r = ngram_counts($text, 3); # Window size = 3
  # $hash_r => { abc => 1, bcd => 1, ... }

  add_to_counts($more_text, 3, $hash_r);

DESCRIPTION

n-Gram analysis is a field in textual analysis which uses sliding window character sequences in order to aid topic analysis, language determination and so on. The n-gram spectrum of a document can be used to compare and filter documents in multiple languages, prepare word prediction networks, and perform spelling correction.

The neat thing about n-grams, though, is that they're really easy to determine. For n=3, for instance, we compute the n-gram counts like so:

    the cat sat on the mat
    ---                     $counts{"the"}++;
     ---                    $counts{"he "}++;
      ---                   $counts{"e c"}++;
       ...

This module provides an efficient XS-based implementation of n-gram spectrum analysis.

There are two functions which can be imported:

    $href = ngram_counts($text[, $window]);

This first function returns a hash reference with the n-gram histogram of the text for the given window size. If the window size is omitted, then 5-grams are used. This seems relatively standard.

    add_to_counts($more_text, $window, $href)

This incrementally adds to the supplied hash; if $window is zero or undefined, then the window size is computed from the hash keys.

Important note on text preparation

Most of the published algorithms for textual n-gram analysis assume that the only characters you're interested in are alphabetic characters and spaces. So before the text is counted, the following preparation is made.

All characters are lowercased; (most papers use upper-casing, but that just feels so 1970s) punctuation and numerals are replaced by stop characters flanked by blanks; multiple spaces are compressed into a single space.

After the counts are made, n-grams containing stop characters are dropped from the hash.

If you prefer to do your own text preparation, use the internal routines process_text and process_text_incrementally instead of count_ngrams and add_to_counts respectively.

SUPPORT

Beep... beep... this is a recorded announcement:

I've released this software because I find it useful, and I hope you might too. But I am a being of finite time and I'd like to spend more of it writing cool modules like this and less of it answering email, so please excuse me if the support isn't as great as you'd like.

Nevertheless, there is a general discussion list for users of all my modules, to be found at http://lists.netthink.co.uk/listinfo/module-mayhem

If you have a problem with this module, someone there will probably have it too.

AUTHOR

Simon Cozens, simon@cpan.org

COPYRIGHT AND LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Text::Ngram, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Text::Ngram

CPAN shell

perl -MCPAN -e shell
install Text::Ngram

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)