The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Unicode::SetAutomaton - UTF-8 based DFAs and Regexps from Unicode sets

SYNOPSIS

  use Unicode::SetAutomaton;
  use Set::IntSpan;

  my $set = Set::IntSpan->new([[0x000000, 0x10FFFF]]);
  my $dfa = Unicode::SetAutomaton->new(classes => [$set]);
  print $dfa->as_expressions;

DESCRIPTION

This module takes sets of Unicode characters and turns them into UTF-8 based minimal deterministic finite automata and regular expressions. Applications include making byte-oriented regular expression and finite automata tools compatible with Unicode input, and possibly performance optimizations.

The author's motivation in writing this module was a search for a fast method to associate characters in UTF-8 encoded strings with character classes. Ignoring memory access performance, automata produced by this module offer a simple, general, near-optimal solution to this problem.

METHODS

Unicode::SetAutomaton->new(classes => [$set1, $set2, ...])

Creates a new automaton from the Set::IntSpan objects. Classes must by in the Unicode range U+0000 .. U+10FFFF. Surrogate code points may be included but have no effect on the output as the definition of UTF-8 disallows their encoding. Sets should not be empty.

The data of the automaton can be accessed through the returned hash reference. The following keys are available:

start_state

The automaton's start state.

transitions

An array of four-tuples representing a transition. The tuples are encoded as array reference. The items are: source state, first byte, last byte, destination state. The automaton moves from the source state to the destination state iff $input_byte >= $first and $input_byte <= $last.

The set of transitions omits illegal moves, if a character is not in any of the input sets, or if the input is not a valid UTF-8 sequence, then that is indicated by the absence of corresponding transitions.

Accepting states do not have outgoing transitions, meaning the automaton will never accept more than a single code point. Copying the transitions of the start state to each accepting state yields an automaton for strings.

For instance, the transitions for the automaton that accepts any valid single-character UTF-8 sequence, as generated by the code in the synopsis, might look like this:

  [ 0, 0x00, 0x7F, 1 ], [ 0, 0xC2, 0xDF, 2 ],
  [ 0, 0xE0, 0xE0, 3 ], [ 0, 0xE1, 0xEC, 4 ],
  [ 0, 0xED, 0xED, 5 ], [ 0, 0xEE, 0xEF, 4 ],
  [ 0, 0xF0, 0xF0, 6 ], [ 0, 0xF1, 0xF3, 7 ],
  [ 0, 0xF4, 0xF4, 8 ], [ 2, 0x80, 0xBF, 1 ],
  [ 3, 0xA0, 0xBF, 2 ], [ 4, 0x80, 0xBF, 2 ],
  [ 5, 0x80, 0x9F, 2 ], [ 6, 0x90, 0xBF, 4 ],
  [ 7, 0x80, 0xBF, 4 ], [ 8, 0x80, 0x8F, 4 ],
input_classes

An array reference of Set::IntSpan objects, exactly as they have been passed to the constructor.

disjoint_classes

An array reference of Set::IntSpan objects. This is the smallest set of classes such that each character in the input classes belongs to exactly one class. Each object is a subset of an input class.

state_to_disjoint

A hash reference that maps a state to one of the disjoint character classes. If there is such mapping, the state is an accepting state, otherwise it is not in an accepting state.

disjoint_to_state

An array reference that maps a disjoint class to a state. Mapping between disjoint classes and states is bijective, so this holds the same information as the state_to_disjoint hash reference.

disjoint_to_input

An array reference that maps a disjoint class to one or more input classes. For example:

  if (exists $dfa->{state_to_disjoint}->{$state}) {
    my $dset = $dfa->{state_to_disjoint}->{$state};
    my $sets = $dfa->{disjoint_to_input}->[$dset];
    print "The matched code point is in these sets: ",
      join ',', @$sets;
  }

Practical applications will likely have to encode the automaton in a different form. For example, using a binary search as transition function the list of transitions would have to be sorted by source state and range, and it may be beneficial to rename accepting states so that the state numbers corresponds to the number of the disjoint classes.

Applications that encode the transition information in arrays should note that if the input is known to be valid UTF-8, perhaps by using a separate DFA that ensures just that in parrallel, only the start state has transitions from most bytes. Other states move only over bytes in the range 0x80 .. 0xBF. So to save space, two arrays could be used.

as_expressions()

Returns a list of regular expressions, one for each disjoint class, in the order of the disjoint classes. For instance, the code in the synopsis might print (wrapped to meet length restrictions):

  [\x00-\x7f]|([\xc2-\xdf]|\xed[\x80-\x9f]|
  ([\xe1-\xec\xee-\xef]|\xf4[\x80-\x8f]|
  [\xf1-\xf3][\x80-\xbf]|\xf0[\x90-\xbf]
  )[\x80-\xbf]|\xe0[\xa0-\xbf])[\x80-\xbf]

AUTHOR / COPYRIGHT / LICENSE

  Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>.
  This module is licensed under the same terms as Perl itself.