The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
=head1 NAME

C::Tokenize - reduce a C file to a series of tokens

=head1 SYNOPSIS

    # Remove all C preprocessor instructions from a C program:
    use C::Tokenize '$cpp_re';
    $c =~ s/$cpp_re//g;

    # Print all the comments in a C program:
    use C::Tokenize '$comment_re';
    while ($c =~ /($comment_re)/) {
        print "$1\n";
    }

=head1 DESCRIPTION

This module provides a tokenizer which breaks C source code into its
smallest meaningful components, and the regular expressions which
match each of these components. For example, the module supplies a
regular expression L</$comment_re> which matches a C comment line.

=head1 REGULAR EXPRESSIONS

The following regular expressions can be imported from this module
using, for example,

    use C::Tokenize '$cpp_re'

to import C<$cpp_re>.

None of the following regular expressions does any capturing. If you
want to capture, add your own parentheses around the regular
expression.

=over

=item $trad_comment_re

Match C</* */> comments.

=item $cxx_comment_re

Match C<//> comments.

=item $comment_re

Match both C</* */> and C<//> comments.

=item $cpp_re

Match a C preprocessor instruction.

=item $char_const_re

Match a character constant, such as C<'a'> or C<'\-'>.

=item $operator_re

Match an operator such as C<+> or C<-->.

=item $number_re

Match a number, either integer, floating point, or hexadecimal. Does
not do octal yet.

=item $word_re

Match a word, such as a function or variable name or a keyword of the
language.

=item $grammar_re

Match other syntactic characters such as C<{> or C<[>.

=item $single_string_re

Match a single C string constant such as C<"this">.

=item $string_re

Match a full-blown C string constant, including compound strings
C<"like" "this">.

=item $reserved_re

Match a C reserved word like C<auto> or C<goto>.

=item $include_local

Match an include statement which uses double quotes, like C<#include "some.c">.

=back

=head1 VARIABLES

=head2 @fields

@Fields contains a list of all the fields which are extracted by
L</tokenize>.

=head1 FUNCTIONS

=head2 decomment

    my $out = decomment ('/* comment */');
    # $out = " comment ";

Remove the traditional C comment marks C</*> and C<*/> from the
beginning and end of a string, leaving only the comment contents. The
string has to begin and end with comment marks.

=head2 tokenize

    my $tokens = tokenize ($file);

Convert C<$file> into a series of tokens. The return value is an array
reference which contains hash references. Each hash reference
corresponds to one token in the C file. Each token contains the
following keys:

=over

=item leading

Any whitespace which comes before the token (called "leading
whitespace").

=item type

The type of the token, which may be 

=over

=item comment

A comment, like 

    /* This */

or

    // this.

=item cpp

A C preprocessor instruction like

    #define THIS 1

or

    #include "That.h".

=item char_const

A character constant, like C<'\0'> or C<'a'>.

=item grammar

A piece of C "grammar", like C<{> or C<]> or C<< -> >>.

=item number

A number such as C<42>,

=item word

A word, which may be a variable name or a function.

=item string

A string, like C<"this">, or even C<"like" "this">.

=item reserved

A C reserved word, like C<auto> or C<goto>.

=back

All of the fields which may be captured are available in the variable
L</@fields> which can be exported from the module:

    use C::Tokenize '@fields';

=item $name

The value of the type. For example, if C<< $token->{name} >> equals
'comment', then the value of the type is in , C<< $token->{comment} >>.

    if ($token->{name} eq 'string') {
        my $c_string = $token->{string};
    }

=item line

The line number of the C file where the token occured. For a
multi-line comment or preprocessor instruction, the line number refers
to the final line.

=back

=head1 EXPORTS

    use C::Tokenize ':all';

exports all the regular expressions from the module.

=head1 SEE ALSO

=over

=item 

The regular expressions contained in this module are shown at this web
page: L<http://www.lemoda.net/c/c-regex/index.html>.

=back

=head1 BUGS

=over

=item Octal not parsed

It does not parse octal expressions.

=item No trigraphs

No handling of trigraphs.

=item Requires Perl 5.10

This module uses named captures in regular expressions, so it requires
Perl 5.10 or more.

=item No line directives

The line numbers provided by L</tokenize> do not respect C line
directives.

=item Insufficient tests

The module has been used somewhat, but the included tests do not
exercise many of the features of C.

=back


=head1 AUTHOR

Ben Bullock, <bkb@cpan.org>

=head2 Request

If you'd like to see this module continued, let me know that you're
using it. For example, send an email, write a bug report, star the
project's github repository, add a patch, add a C<++> on Metacpan.org,
or write a rating at CPAN ratings. It really does make a
difference. Thanks.

=head1 COPYRIGHT & LICENCE

This package and associated files are copyright (C) 
-2015
Ben Bullock.

You can use, copy, modify and redistribute this package and associated
files under the Perl Artistic Licence or the GNU General Public
Licence.



=head1 TERMINOLOGY

This defines the terminology used in this document.

=over

=item Convenience function

In this document, a "convenience function" indicates a function which
solves some of the problems, some of the time, for some of the people,
but which may not be good enough for all envisaged uses. A convenience
function is an 80/20 solution, something which solves (about) 80% of
the problems with 20% of the effort. Something which does the obvious
things, but may not do all the things you might want, a time-saver for
the most basic usage cases.

=item BUGS

In this document, the section BUGS describes possible deficiencies,
problems, and workarounds with the module. It's not a guide to bug
reporting, or even a list of actual bugs. The name "BUGS" is the
traditional name for this sort of section in a Unix manual page.

=back