=head1 NAME
C::Tokenize - reduce a C file to a series of tokens
=head1 SYNOPSIS
# Remove all C preprocessor instructions from a C program:
use C::Tokenize '$cpp_re';
$c =~ s/$cpp_re//g;
# Print all the comments in a C program:
use C::Tokenize '$comment_re';
while ($c =~ /($comment_re)/) {
print "$1\n";
}
=head1 DESCRIPTION
This module provides a tokenizer which breaks C source code into its
smallest meaningful components, and the regular expressions which
match each of these components. For example, the module supplies a
regular expression L</$comment_re> which matches a C comment line.
=head1 REGULAR EXPRESSIONS
The following regular expressions can be imported from this module
using, for example,
use C::Tokenize '$cpp_re'
to import C<$cpp_re>.
None of the following regular expressions does any capturing. If you
want to capture, add your own parentheses around the regular
expression.
=over
=item $trad_comment_re
Match C</* */> comments.
=item $cxx_comment_re
Match C<//> comments.
=item $comment_re
Match both C</* */> and C<//> comments.
=item $cpp_re
Match a C preprocessor instruction.
=item $char_const_re
Match a character constant, such as C<'a'> or C<'\-'>.
=item $operator_re
Match an operator such as C<+> or C<-->.
=item $number_re
Match a number, either integer, floating point, or hexadecimal. Does
not do octal yet.
=item $word_re
Match a word, such as a function or variable name or a keyword of the
language.
=item $grammar_re
Match other syntactic characters such as C<{> or C<[>.
=item $single_string_re
Match a single C string constant such as C<"this">.
=item $string_re
Match a full-blown C string constant, including compound strings
C<"like" "this">.
=item $reserved_re
Match a C reserved word like C<auto> or C<goto>.
=item $include_local
Match an include statement which uses double quotes, like C<#include "some.c">.
=back
=head1 VARIABLES
=head2 @fields
@Fields contains a list of all the fields which are extracted by
L</tokenize>.
=head1 FUNCTIONS
=head2 decomment
my $out = decomment ('/* comment */');
# $out = " comment ";
Remove the traditional C comment marks C</*> and C<*/> from the
beginning and end of a string, leaving only the comment contents. The
string has to begin and end with comment marks.
=head2 tokenize
my $tokens = tokenize ($file);
Convert C<$file> into a series of tokens. The return value is an array
reference which contains hash references. Each hash reference
corresponds to one token in the C file. Each token contains the
following keys:
=over
=item leading
Any whitespace which comes before the token (called "leading
whitespace").
=item type
The type of the token, which may be
=over
=item comment
A comment, like
/* This */
or
// this.
=item cpp
A C preprocessor instruction like
#define THIS 1
or
#include "That.h".
=item char_const
A character constant, like C<'\0'> or C<'a'>.
=item grammar
A piece of C "grammar", like C<{> or C<]> or C<< -> >>.
=item number
A number such as C<42>,
=item word
A word, which may be a variable name or a function.
=item string
A string, like C<"this">, or even C<"like" "this">.
=item reserved
A C reserved word, like C<auto> or C<goto>.
=back
All of the fields which may be captured are available in the variable
L</@fields> which can be exported from the module:
use C::Tokenize '@fields';
=item $name
The value of the type. For example, if C<< $token->{name} >> equals
'comment', then the value of the type is in , C<< $token->{comment} >>.
if ($token->{name} eq 'string') {
my $c_string = $token->{string};
}
=item line
The line number of the C file where the token occured. For a
multi-line comment or preprocessor instruction, the line number refers
to the final line.
=back
=head1 EXPORTS
use C::Tokenize ':all';
exports all the regular expressions from the module.
=head1 SEE ALSO
=over
=item
The regular expressions contained in this module are shown at this web
page: L<http://www.lemoda.net/c/c-regex/index.html>.
=back
=head1 BUGS
=over
=item Octal not parsed
It does not parse octal expressions.
=item No trigraphs
No handling of trigraphs.
=item Requires Perl 5.10
This module uses named captures in regular expressions, so it requires
Perl 5.10 or more.
=item No line directives
The line numbers provided by L</tokenize> do not respect C line
directives.
=item Insufficient tests
The module has been used somewhat, but the included tests do not
exercise many of the features of C.
=back
=head1 AUTHOR
Ben Bullock, <bkb@cpan.org>
=head2 Request
If you'd like to see this module continued, let me know that you're
using it. For example, send an email, write a bug report, star the
project's github repository, add a patch, add a C<++> on Metacpan.org,
or write a rating at CPAN ratings. It really does make a
difference. Thanks.
=head1 COPYRIGHT & LICENCE
This package and associated files are copyright (C)
-2015
Ben Bullock.
You can use, copy, modify and redistribute this package and associated
files under the Perl Artistic Licence or the GNU General Public
Licence.
=head1 TERMINOLOGY
This defines the terminology used in this document.
=over
=item Convenience function
In this document, a "convenience function" indicates a function which
solves some of the problems, some of the time, for some of the people,
but which may not be good enough for all envisaged uses. A convenience
function is an 80/20 solution, something which solves (about) 80% of
the problems with 20% of the effort. Something which does the obvious
things, but may not do all the things you might want, a time-saver for
the most basic usage cases.
=item BUGS
In this document, the section BUGS describes possible deficiencies,
problems, and workarounds with the module. It's not a guide to bug
reporting, or even a list of actual bugs. The name "BUGS" is the
traditional name for this sort of section in a Unix manual page.
=back