The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

String::Markov - A Moo-based, text-oriented Markov Chain module

VERSION

version 0.009

SYNOPSIS

  my $mc = String::Markov->new();

  $mc->add_files(@ARGV);

  print $mc->generate_sample . "\n" for (1..20);


  my $mc = String::Markov->new(order => 1, sep => ' ');

  for my $stanza (@The_Rime_of_the_Ancient_Mariner) {
        $mc->add_sample($stanza);
  }
  
  print $mc->generate_sample;

DESCRIPTION

String::Markov is a Moo-based Markov Chain module, designed to easily consume and produce text.

ATTRIBUTES

order

The order of the chain, i.e. how much past state is used to determine the next state. The default of 2 is reasonable for constructing new names/words when splitting into characters, or for long-ish works when splitting into words.

split_sep

How states are split. This value (or sep; see "new()") is passed directly as the first argument to "split" in perlfunc, so using ' ' has special semantics. Regular expressions will work as well, but be aware that any matched characters are discarded.

join_sep

How states are joined. This value (or sep; see "new()") is passed as the first argument of "join" in perlfunc. In addition, it is used to build keys for internal hashes. This can cause problems in cases where split_sep() produces sequences like 'ae', 'io', 'a', 'ei', 'o', or 'ae', 'i', 'o', which will all turn into 'aeio' with the default of ''. If join_sep is '*' instead, then three unique keys result: 'ae*io', 'a*ei*o', and 'ae*i*o'. See "add_sample()".

null

What is used to mark the beginning and end of a sample internally. The default of "\0" should work for UTF-8 text, but may cause problems with UTF-16 or other encodings.

stable

Whether or not to always produce the same results from the same internal state. If stable is true, then the same random seed (see "srand" in perlfunc) will produce identical results for chains created from the same inputs.

normalize

Whether to normalize Unicode strings. This value, if true, is passed as the first argument to Unicode::Normalize::normalize. The default 'C' should do what most people expect, but it may be the case that 'D' is what you want. If you're not using Unicode, set this to undef.

do_chomp

Whether to "chomp" in perlfunc lines when reading files. See "add_files()".

METHODS

new()

  # Defaults
  my $mc = String::Markov->new(
        order     => 2,
        sep       => '',
        split_sep => undef,
        join_sep  => undef,
        null      => "\0",
        stable    =>  1,
        normalize => 'C',
        do_chomp  => 1,
  );

The sep argument doesn't correlate to an attribute, but is used to initialize split_sep and/or join_sep if either is undefined.

See "ATTRIBUTES".

split_line()

This is the method "add_sample()" calls when it is passed a non-ref argument. It returns an array of states (usually individual characters or words) that are used to build the Markov Chain model.

The default implementation is equivalent to:

  sub split_line {
        my ($self, $sample) = @_;
        $sample = normalize($self->normalize, $sample) if $self->normalize;
        return split($self->split_sep, $sample);
  }

This method can be overridden to deal with unusual data.

add_sample()

This method adds samples to build the Markov Chain model. It takes a single argument, which can be either a string or an array reference. If the argument is an array reference, its elements are directly used to update the Markov Chain. If it is a string, add_sample() uses the split_line() method to create an array of states, and then updates the Markov Chain.

Note that this function generates hash keys for the transition matrix. The keys are built according to the order, null, and join_sep attributes, so if an instance is created with:

  my $mc = String::Markov->new(null => '!', order => 2, join_sep => '*');
  $mc->add_sample($_) for (@sample_lines);

Then the internal transition matrix might look like:

  {
    '!*!' => { 'A' => 5, 'B' => 7, ... }, # Initial state
    '!*A' => { ... },
    '!*B' => { ... },
    ...
    'x*y' => { '!' => 4 },                # always end after 'xy'
    'y*z' => { '!' => 3, 'q' => 2 },      # sometimes end after 'yz'
    ...
  }

add_files()

This is a simple convenience method, designed to replace code like:

  while(<>) { chomp; $mc->add_sample($_) }

It takes a list of file names as arguments, and adds them line-by-line.

generate_sample()

This method returns a sequence of states, generated from the Markov Chain using the Monte Carlo method.

If called in scalar context, the states are joined with join_sep before being returned.

SEE ALSO

Algorithm::MarkovChain

AUTHOR

Grant Mathews <gmathews@cpan.org>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2014 by Grant Mathews.

This is free software, licensed under:

  The Artistic License 2.0 (GPL Compatible)