package HTTP::Proxy::BodyFilter::htmlparser;
$HTTP::Proxy::BodyFilter::htmlparser::VERSION = '0.304';
use strict;
use Carp;
use HTTP::Proxy::BodyFilter;
use vars qw( @ISA );
@ISA = qw( HTTP::Proxy::BodyFilter );
sub init {
croak "First parameter must be a HTML::Parser object"
unless $_[1]->isa('HTML::Parser');
my $self = shift;
$self->{_parser} = shift;
my %args = (@_);
$self->{rw} = delete $args{rw};
}
sub filter {
my ( $self, $dataref, $message, $protocol, $buffer ) = @_;
@{ $self->{_parser} }{qw( output message protocol )} =
( "", $message, $protocol );
$self->{_parser}->parse($$dataref);
$self->{_parser}->eof if not defined $buffer; # last chunk
$$dataref = $self->{_parser}{output} if $self->{rw};
}
sub will_modify { $_[0]->{rw} }
1;
__END__
=head1 NAME
HTTP::Proxy::BodyFilter::htmlparser - Filter using HTML::Parser
=head1 SYNOPSIS
use HTTP::Proxy::BodyFilter::htmlparser;
# $parser is a HTML::Parser object
$proxy->push_filter(
mime => 'text/html',
response => HTTP::Proxy::BodyFilter::htmlparser->new( $parser );
);
=head1 DESCRIPTION
The L<HTTP::Proxy::BodyFilter::htmlparser> lets you create a
filter based on the L<HTML::Parser> object of your choice.
This filter takes a L<HTML::Parser> object as an argument to its constructor.
The filter is either read-only or read-write. A read-only filter will
not allow you to change the data on the fly. If you request a read-write
filter, you'll have to rewrite the response-body completely.
With a read-write filter, you B<must> recreate the whole body data. This
is mainly due to the fact that the L<HTML::Parser> has its own buffering
system, and that there is no easy way to correlate the data that triggered
the L<HTML::Parser> event and its original position in the chunk sent by the
origin server. See below for details.
Note that a simple filter that modify the HTML text (not the tags) can
be created more easily with L<HTTP::Proxy::BodyFilter::htmltext>.
=head2 Creating a HTML::Parser that rewrites pages
A read-write filter is declared by passing C<rw =E<gt> 1> to the constructor:
HTTP::Proxy::BodyFilter::htmlparser->new( $parser, rw => 1 );
To be able to modify the body of a message, a filter created with
L<HTTP::Proxy::BodyFilter::htmlparser> must rewrite it completely. The
L<HTML::Parser> object can update a special attribute named C<output>.
To do so, the L<HTML::Parser> handler will have to request the C<self>
attribute (that is to say, require access to the parser itself) and
update its C<output> key.
The following attributes are added to the L<HTML::Parser> object by this filter:
=over 4
=item output
A string that will hold the data sent back by the proxy.
This string will be used as a replacement for the body data only
if the filter is read-write, that is to say, if it was initialised with
C<rw =E<gt> 1>.
Data should always be B<appended> to C<$parser-E<gt>{output}>.
=item message
A reference to the L<HTTP::Message> that triggered the filter.
=item protocol
A reference to the L<HTTP::Protocol> object.
=back
=head1 METHODS
This filter defines three methods, called automatically:
=over 4
=item filter()
The C<filter()> method handles all the interactions with the L<HTML::Parser>
object.
=item init()
Initialise the filter with the HTML::Parser object passed to the constructor.
=item will_modify()
This method returns a boolean value that indicates to the system
if it will modify the data passing through. The value is actually
the value of the C<rw> parameter passed to the constructor.
=back
=head1 SEE ALSO
L<HTTP::Proxy>, L<HTTP::Proxy::Bodyfilter>,
L<HTTP::Proxy::BodyFilter::htmltext>.
=head1 AUTHOR
Philippe "BooK" Bruhat, E<lt>book@cpan.orgE<gt>.
=head1 COPYRIGHT
Copyright 2003-2015, Philippe Bruhat.
=head1 LICENSE
This module is free software; you can redistribute it or modify it under
the same terms as Perl itself.
=cut