The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Anagrams - pure Perl anagram finder

VERSION

version 0.019

SYNOPSIS

  use v5.10;
  use Lingua::Anagrams;
  
  open my $fh, '<', 'words.txt' or die "Aargh! $!";         # some 100,000 words
  my @words = map { ( my $w = $_ ) =~ s/\W+//g; $w } <$fh>;
  close $fh;
  
  my @enormous = grep { length($_) > 6 } @words;
  my @huge     = grep { length($_) == 6 } @words;
  my @big      = grep { length($_) == 5 } @words;
  my @medium   = grep { length($_) == 4 } @words;
  my @small    = grep { length($_) == 3 } @words;
  my @tiny     = grep { length($_) < 3 } @words;
  
  my $anagramizer = Lingua::Anagrams->new(
      [ \@enormous, \@huge, \@big, \@medium, \@small, \@tiny ],
      limit => 30 );
  
  my $t1 = time;
  my @anagrams =
    $anagramizer->anagrams( 'Ada Hyacinth Melton-Houghton', sorted => 1, min => 100 );
  my $t2 = time;
  
  say join ' ', @$_ for @anagrams;
  say '';
  say scalar(@anagrams) . ' anagrams';
  say 'it took ' . ( $t2 - $t1 ) . ' seconds';
  
  say "\nnow for a random sample\n";
  my $i = $anagramizer->iterator( 'Ada Hyacinth Melton-Houghton', random => 1 );
  say join ' ', @{ $i->() };

Giving you

  ...
  manned ohioan thatch toughly
  menial noonday thatch though
  monthly ohioan thatch unaged
  moolah nighty notched utahan
  moolah tannin though yachted
  moolah thatch toeing unhandy
  
  1582 anagrams
  it took 129 seconds
  
  now for a random sample
  
  noumenal acanthi doth thy hog

DESCRIPTION

Lingua::Anagrams constructs tries out of a lists of words you give it. It then uses these tries to find all the anagrams of a phrase you give to its anagrams method. A dynamic programming algorithm is used to accelerate the search at the cost of memory. See new for how one may modify this algorithm.

Be aware that the anagram algorithm has been golfed down pretty far to squeeze more speed out of it. It isn't the prettiest.

METHODS

new

  CLASS->new( $word_list, %params )

Construct a new anagram engine from a word list, or a list of word lists. If you provide multiple word lists, each successive list will be understood as an augmentation of those preceding it.* If you search for the anagrams of a phrase, the algorithm will abandon one list and try the next if it is unable to find sufficient anagrams with the current list. You can use cascading word lists like this to find interesting anagrams of long phrases as well as short ones in a reasonable amount of time. If on the other hand you use only one comprehensive list you will find that long phrases have many millions of anagrams the calculation of which take vast amounts of memory and time. In particular you will want to limit the number of short words in the earlier lists as these multiply the possible anagrams much more quickly.

The optional construction parameters may be provided either as a list of key-value pairs or as a hash reference. The understood parameters are:

limit

The character count limit used by the dynamic programming algorithm to throttle memory consumption somewhat. If you wish to find the anagrams of a very long phrase you may find the caching in the dynamic programming algorithm consumes too much memory. Set this limit lower to protect yourself from memory exhaustion (and slow things down).

The default limit is set by the global $LIMIT variable. It will be 20 unless you tinker with it.

clean

A code reference specifying how text is to be cleaned of extraneous characters and normalized. The default cleaning function is

  sub _clean {
      $_[0] =~ s/\W+//g;
      $_[0] = lc $_[0];
  }

Note that this function, like _clean, must modify its argument directly.

sorted

A boolean. If true, the anagram list will be returned sorted.

min

The minimum number of anagrams to look for. This value is only consulted if the anagram engine has more than one word list. If the first word list returns too few anagrams, the second is applied. If no minimum is provided the effective minimum is one.

* If you provide multiple word lists, note that later lists will be discarded if they do not actually augment what came before. Thus the number of lists that anagramizer considers you to be using may be different from the number you think you are using. See lists.

anagrams

  $self->anagrams( $phrase, %opts )

Returns a list of array references, each reference containing a list of words which together constitute an anagram of the phrase.

Options may be passed in as a list of key value pairs or as a hash reference. The following options are supported at this time:

sorted

As with the constructor option, this determines whether the anagrams are sorted internally and with respect to each other. It overrides the constructor parameter, which provides the default.

min

The minimum number of anagrams to look for. This value is only consulted if the anagram engine has more than one word list. This overrides any value from the constructor parameter min.

start_list

Index of first word list to try. This will be 0 by default. Set it to -1 to use only the largest word list. The bigger the word list you start with, the smaller the words you are likely to get in any particular anagram but also the faster you will fail when no anagrams are possible.

iterator

  $self->iterator( $phrase, %opts )

Generates a code reference one can use to iterate over all the anagrams of a phrase. This iterator will be considerably slower than the anagrams method if you want to fetch all the anagrams of a phrase but considerably faster if your phrase is large and you just want a sample of anagrams. And if your phrase is sufficiently large that there is not sufficient memory and/or time to create the complete anagram list, an iterator is your only option. Iterators are much more memory efficient.

If the anagram engine holds multiple word lists, longer lists are consulted only as necessary.

As with the other methods, the optional %opts may be provided as either a list of key-value pairs or as a hash reference. The understood options are

sorted

If true, anagrams will be internally sorted, though not necessarily relative to each other. In fact, because of how anagrams are gathered, they will tend to be returned in sorted order unless the random parameter is set to true.

random

If true, the anagrams are returned in relatively random order. The order is only relatively random because it will still be the case that longer word lists are only consulted as a last resort.

Note that random iterators, though only slightly slower than non-random iterators, will come to use considerably more memory. This is because they will come to be holding many incompletely used sub-iterators.

start_list

Index of first word list to try. This will be 0 by default. This is the same option as with the anagrams method. It is particularly useful in conjunction with the random option. It will speed up both success and failure.

key

  $self->key($phrase)

Converts a phrase into a key suitable for use in caching anagram lists.

  say $ag->key('box');   # 98:1(11)1(7)1
  say $ag->key('book');  # 98:1(7)1(2)2
  say $ag->key('bag');   # 97:1.1(3)1
  say $ag->key('gab');   # 97:1.1(3)1

A key is just a compressed representation of the character counts in the phrase.

lists

  $self->lists

Returns the number of word lists being used by the anagramizer.

SOME CLEVER BITS

One trick I use to speed things up is to convert all characters to integers immediately. If you're using integers, you can treat arrays as really fast hashes.

The natural way to walk the trie is with recursion, but I use a stack and a loop to speed things up.

I use a jump table to keep track of the actual characters under consideration so when walking the trie I only consider characters that might be in anagrams.

A particular step of anagram generation consists of pulling out all words that can be formed with the current character counts. If a particular character count is not decremented in the formation of any word in a given step we know we've reached a dead end and we should give up.

Similarly, if we do touch every character in a particular step we can collect all the words extracted which touch that character and descend only into the remaining possibilities for those character counts because the other words one might extract are necessarily contained in the remaining character counts.

The dynamic programming bit consists of memoizing the anagram lists keyed to the character counts so we never extract the anagrams for a particular set of counts twice (of course, we have to calculate this key many times, which is not free).

I localize a bunch of variables on the first method call so that thereafter these values can be treated as global. This saves a lot of copying.

After the initial method calls I use functions, which saves a lot of lookup time.

In stack operations I use push and pop in lieu of unshift and shift. The former are more efficient, especially with short arrays.

AUTHOR

David F. Houghton <dfhoughton@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2014 by David F. Houghton.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.