The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
#! /usr/bin/perl -w
#
# debugging - strategies for dealing with assembled patterns that
# cannot be compiled. Usual message is
# "Unmatched ( in regex; marked by <-- HERE in m/a( <-- HERE"
#
# Copyright (C) David Landgren 2005

use strict;
use Getopt::Std;
use Regexp::Assemble;

use vars '$VERSION';
$VERSION = '0.1';

getopts( 'v', \my %opt );

print "$VERSION\n" and exit if $opt{v};

my $r = Regexp::Assemble->new;

while( <> ) {
    chomp;

    # take a copy of the assembly and add the next pattern
    my $clone = $r->clone;
    $clone->add($_);

    # if we compile the regexp, does it blow up?
    eval { my $regexp = $clone->re };

    # it blew up
    if( $@ ) {
        report();
    }
    else {
        # ok, now add it for real to the assembly
        $r->add( $_ );
    }
}

sub report {
    my $lsqb = $_ =~ tr/[/[/;
    my $rsqb = $_ =~ tr/]/]/;
    my $lpar = $_ =~ tr/(/(/;
    my $rpar = $_ =~ tr/)/)/;

    warn <<WARNING;
$ARGV($.): [$_]
counts: [=$lsqb ]=$rsqb (=$lpar )=$rpar
The above line causes the assembly to fail. Look for unbalanced parens or
brackets, or counts greater than 1. If the line looks ok, it could be
because of an interaction with a previously encountered line (try reversing
the order of the lines to find the other line).

WARNING
}

=head1 NAME

debugging - a strategy for dealing with uncompilable regexps

=head1 SYNOPSIS

B<debugging> file-of-patterns

=head1 OPTIONS

=over 5

=item B<-v>

prints out the version of script.

=back

=head1 DESCRIPTION

By default, C<Regexp::Assemble> uses a naive strategy for chopping
up input. It can get upset if you try to feed it a pattern with
nested parentheses. For instance, C<a(b(c))> and C<a(b(d))> will
cause it to fail. If you really want to do that, you'll have to
come up with a regexp that can chop the above into C<a>, C<(b(c))>
and C<a>, C<(b(d))>.

Of course, that will only buy you C<a(?:(b(c))|(b(d)))> at the
present time. Teaching it to produce C<a(b([cd]))> would be
considerably more complex. Patches welcome.

So, when adding hundreds of patterns, it's not always obvious
figuring out a way of isolating the offending pattern, especially
when they are coming out of a file or some other process you don't
have control over.

The trick, then, consists of adding a pattern one at a time, and
then standing back and watching whether producing a regexp from the
list of patterns seen so far will cause things to blow up. So you
take a copy of the assembly, and add the pattern to the copy. If
that works, then add it to the real assembly. Repeat until done.

You don't want to do this all the time, because it is of course
much less efficient.

=head1 SEE ALSO

L<Regexp::Assemble>

=head1 AUTHOR

Copyright (C) 2005 David Landgren. All rights reserved.

=head1 LICENSE

This script is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.