HTML::LinkExtractor - Extract links from an HTML document
HTML::LinkExtractor is used for extracting links from HTML. It is very similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.
Example ( please run the examples ):
use HTML::LinkExtractor; use Data::Dumper; my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>}; my $LX = new HTML::LinkExtractor(); $LX->parse(\$input); print Dumper($LX->links); __END__ # the above example will yield $VAR1 = [ { '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>', 'href' => bless(do{\(my $o = 'http://perl.com/')}, 'URI::http'), 'tag' => 'a' } ];
HTML::LinkExtractor will also correctly extract nested link-type tags.
HTML::LinkExtractor
## the demo perl LinkExtractor.pm perl LinkExtractor.pm file.html othefile.html ## or if the module is installed, but you don't know where perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/LinkExtractor.pm}} " perl -MHTML::LinkExtractor -e' system $^X, $INC{q{HTML/LinkExtractor.pm}} ' ## or use HTML::LinkExtractor; use LWP::Simple qw( get ); my $base = 'http://search.cpan.org'; my $html = get($base.'/recent'); my $LX = new HTML::LinkExtractor(); $LX->parse(\$html); print qq{<base href="$base">\n}; for my $Link( @{ $LX->links } ) { ## new modules are linked by /author/NAME/Dist if( $$Link{href}=~ m{^\/author\/\w+} ) { print $$Link{_TEXT}."\n"; } } undef $LX; __END__ ## or use HTML::LinkExtractor; use Data::Dumper; my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>}; my $LX = new HTML::LinkExtractor( sub { print Data::Dumper::Dumper(@_); }, 'http://perlFox.org/', ); $LX->parse(\$input); $LX->strip(1); $LX->parse(\$input); __END__
$LX->new([\&callback, [$baseUrl, [1]]])
Accepts 3 arguments, all of which are optional. If for example you want to pass a $baseUrl, but don't want to have a callback invoked, just put undef in place of a subref.
$baseUrl
undef
This is the only class method.
a callback ( a sub reference, as in sub{}, or \&sub) which is to be called each time a new LINK is encountered ( for @HTML::LinkExtractor::TAGS_IN_NEED this means after the closing tag is encountered )
sub{}
\&sub
@HTML::LinkExtractor::TAGS_IN_NEED
The callback receives an object reference($LX) and a link hashref.
$LX
and a base URL ( URI->new, so its up to you to make sure it's valid which is used to convert all relative URI's to absolute ones.
$ALinkP{href} = URI->new_abs( $ALink{href}, $base );
A "boolean" (just stick with 1). See the example in "DESCRIPTION". Normally, you'd get back _TEXT that looks like
'_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
If you turn this option on, you'll get the following instead
'_TEXT' => ' I am a LINK!!! ',
The private utility function _stripHTML does this by using HTML::TokeParsers method get_trimmed_text.
_stripHTML
You can turn this feature on an off by using $LX->strip(undef || 0 || 1)
$LX->strip(undef || 0 || 1)
$LX->parse( $filename || *FILEHANDLE || \$FileContent )
Each time you call parse, you should pass it a $filename a *FILEHANDLE or a \$FileContent
parse
$filename
*FILEHANDLE
\$FileContent
Each time you call parse a new HTML::TokeParser::Simple object is created and stored in $this->{_tp}.
HTML::TokeParser::Simple
$this->{_tp}
You shouldn't need to mess with the TokeParser object.
$LX->links()
Only after you call parse will this method return anything. This method returns a reference to an ArrayOfHashes, which basically looks like (Data::Dumper output)
$VAR1 = [ { type => 'img', src => 'image.png' }, ];
Please note that if yo provide a callback this array will be empty.
$LX->strip( [ 0 || 1 ])
If you pass in undef (or nothing), returns the state of the option. Passing in a true or false value sets the option.
If you wanna know what the option does see $LX->new([\&callback, [$baseUrl, [1]]])
Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.
%HTML::LinkExtractor::TAGS
Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which can contain URI's (the links!!)
@HTML::LinkExtractor::VALID_URL_ATTRIBUTES
Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for which the '_TEXT' attribute is provided, like <a href="#"> TEST </a>
'_TEXT'
<a href="#"> TEST </a>
I took at look at %HTML::Tagset::linkElements and the following URL's
%HTML::Tagset::linkElements
http://www.blooberry.com/indexdot/html/tagindex/all.htm http://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htm http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm http://www.blooberry.com/indexdot/html/tagpages/a/area.htm http://www.blooberry.com/indexdot/html/tagpages/b/base.htm http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm http://www.blooberry.com/indexdot/html/tagpages/d/del.htm http://www.blooberry.com/indexdot/html/tagpages/d/div.htm http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm http://www.blooberry.com/indexdot/html/tagpages/i/image.htm http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm http://www.blooberry.com/indexdot/html/tagpages/l/link.htm http://www.blooberry.com/indexdot/html/tagpages/o/object.htm http://www.blooberry.com/indexdot/html/tagpages/q/q.htm http://www.blooberry.com/indexdot/html/tagpages/s/script.htm http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm And the special cases <!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd"> http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm and <meta HTTP-EQUIV="Refresh" CONTENT="5; URL=http://www.foo.com/foo.html"> http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm
podmaster (see CPAN) aka crazyinsomniac@yahoo.com
HTML::LinkExtor, HTML::TokeParser::Simple, HTML::Tagset.
To install HTML::LinkExtractor, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::LinkExtractor
CPAN shell
perl -MCPAN -e shell install HTML::LinkExtractor
For more information on module installation, please visit the detailed CPAN module installation guide.