The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Devel::Tokenizer::C - Generate C source for fast keyword tokenizer

SYNOPSIS

  use Devel::Tokenizer::C;
  
  $t = new Devel::Tokenizer::C TokenFunc => sub { "return \U$_[0];\n" };
  
  $t->add_tokens( qw( bar baz ) )->add_tokens( ['for'] );
  $t->add_tokens( [qw( foo )], 'defined DIRECTIVE' );
  
  print $t->generate;

DESCRIPTION

The Devel::Tokenizer::C module provides a small class for creating the essential ANSI C source code for a fast keyword tokenizer.

The generated code is optimized for speed. On the ANSI-C keyword set, it's 2-3 times faster than equivalent code generated with the gprof utility.

The above example would print the following C source code:

  switch( tokstr[0] )
  {
    case 'b':
      switch( tokstr[1] )
      {
        case 'a':
          switch( tokstr[2] )
          {
            case 'r':
              if( tokstr[3] == '\0' )
              {                                     /* bar        */
                return BAR;
              }
  
              goto unknown;
  
            case 'z':
              if( tokstr[3] == '\0' )
              {                                     /* baz        */
                return BAZ;
              }
  
              goto unknown;
  
            default:
              goto unknown;
          }
  
        default:
          goto unknown;
      }
  
    case 'f':
      switch( tokstr[1] )
      {
        case 'o':
          switch( tokstr[2] )
          {
  #if defined DIRECTIVE
            case 'o':
              if( tokstr[3] == '\0' )
              {                                     /* foo        */
                return FOO;
              }
  
              goto unknown;
  #endif /* defined DIRECTIVE */
  
            case 'r':
              if( tokstr[3] == '\0' )
              {                                     /* for        */
                return FOR;
              }
  
              goto unknown;
  
            default:
              goto unknown;
          }
  
        default:
          goto unknown;
      }
  
    default:
      goto unknown;
  }

So the generated code only includes the main switch statement for the tokenizer. You can configure most of the generated code to fit for your application.

CONFIGURATION

TokenFunc => SUBROUTINE

A reference to the subroutine that returns the code for each token match. The only parameter to the subroutine is the token string.

This is the default subroutine:

  TokenFunc => sub { "return $_[0];\n" }

TokenString => STRING

Identifier of the C character array that contains the token string. The default is tokstr.

UnknownLabel => STRING

Label that should be jumped to via goto if there's no keyword matching the token. The default is unknown.

TokenEnd => STRING

Character that defines the end of each token. The default is the null character '\0'.

CaseSensitive => 0 | 1

Boolean defining whether the generated tokenizer should be case sensitive or not. This will only affect the letters A-Z. The default is 1, so the generated tokenizer is case sensitive.

ADDING TOKENS

You can add tokens using the add_tokens method.

The method either takes a list of token strings or a reference to an array of token strings which can optionally be followed by a preprocessor directive string.

Calls to add_tokens can be chained together, as the method returns a reference to its object.

GENERATING THE CODE

The generate method will return a string with the tokenizer switch statement. If no tokens were added, it will return an empty string.

AUTHOR

Marcus Holland-Moritz <mhx@cpan.org>

BUGS

I hope none, since the code is pretty short. Perhaps lack of functionality ;-)

COPYRIGHT

Copyright (c) 2003, Marcus Holland-Moritz. All rights reserved. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.