URL::Transform - perform URL transformations in various document types
my $output; my $urlt = URL::Transform->new( 'document_type' => 'text/html;charset=utf-8', 'content_encoding' => 'gzip', 'output_function' => sub { $output .= "@_" }, 'transform_function' => sub { return (join '|', @_) }, ); $urlt->parse_file($Bin.'/data/URL-Transform-01.html'); print "and this is the output: ", $output;
URL::Transform is a generic module to perform an url transformation in a documents. Accepts callback function using which the url link can be changed.
There are different modules to handle different document types, elements or attributes:
text/html
text/vnd.wap.wml
application/xhtml+xml
application/vnd.wap.xhtml+xml
URL::Transform::using::HTML::Parser, URL::Transform::using::XML::SAX (incomplete was used only to benchmark)
text/css
URL::Transform::using::CSS::RegExp
text/html/meta-content
URL::Transform::using::HTML::Meta
application/x-javascript
URL::Transform::using::Remove
By passing parser option to the URL::Transform->new() constructor you can set what library will be used to parse and execute the output and transform functions. Note that the elements inside for example text/html that are of a different type will be transformed via "default_for($document_type)" modules.
parser
URL::Transform->new()
transform_function is called with following arguments:
transform_function
transform_function->( 'tag_name' => 'img', 'attribute_name' => 'src', 'url' => 'http://search.cpan.org/s/img/cpan_banner.png', );
and must return (un)modified url as the return value.
output_function is called with (already modified) document chunk for outputting.
output_function
content_encoding document_type parser transform_function output_function
For HTML/XML can be HTML::Parser, XML::SAX
text/html - default
Function that will be called to make the transformation. The function will receive one argument - url text.
Reference to function that will receive resulting output. The default one is to use print.
Can be set to gzip or deflate. By default it is undef, so there is no content encoding.
gzip
deflate
undef
Object constructor.
Requires transform_function a CODE ref argument.
The rest of the arguments are optional. Here is the list with defaults:
document_type => 'text/html;charset=utf-8', output_function => sub { print @_ }, parser => 'HTML::Parser', content_encoding => undef,
Returns default parser for a supplied $document_type.
Can be used also as a set function with additional argument - parser name.
If called as object method set the default parser for the object. If called as module function set the default parser for a whole module.
Submit document as a string for parsing.
This some function must be implemented by helper parsing classes.
Submit chunk of a document for parsing.
This some function should be implemented by helper parsing classes.
Return true/false if the parser can parse in chunks.
Submit file for parsing.
# To simplify things, reformat the %HTML::Tagset::linkElements # hash so that it is always a hash of hashes.
# Construct a hash of tag names that may have links.
# Construct a hash of all possible JavaScript attribute names
Will return decoded string suitable for parsing. Decoding is chosen according to the $self->content_encoding.
Decoding is run automatically for every chunk/string/file.
Will return encoded string. Encoding is chosen according to the $self->content_encoding.
NOTE if you want to have your content encoded back to the $self->content_encoding you will have to run this method in your code. Argument to the output_function() are always plain text.
output_function()
Returns hash reference of supported content encodings.
Benchmark: timing 10000 iterations of HTML::Parser , XML::LibXML::SAX, XML::SAX::PurePerl... HTML::Parser : 3 wallclock secs ( 2.41 usr + 0.04 sys = 2.45 CPU) @ 4081.63/s (n=10000) XML::LibXML::SAX : 29 wallclock secs (27.22 usr + 0.11 sys = 27.33 CPU) @ 365.90/s (n=10000) XML::SAX::PurePerl: 192 wallclock secs (180.62 usr + 0.50 sys = 181.12 CPU) @ 55.21/s (n=10000)
There are urls in pics meta tag: <meta http-equiv="pics-label" content=" .... See http://www.w3.org/PICS/.
pics
<meta http-equiv="pics-label" content=" ...
HTML::Parser, URL::Transform::using::HTML::Parser
Jozef Kutej <jkutej at cpan.org>
<jkutej at cpan.org>
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
To install URL::Transform, copy and paste the appropriate command in to your terminal.
cpanm
cpanm URL::Transform
CPAN shell
perl -MCPAN -e shell install URL::Transform
For more information on module installation, please visit the detailed CPAN module installation guide.