The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::SimpleLinkExtor - Extract links from HTML

SYNOPSIS

        use HTML::SimpleLinkExtor;

        my $extor = HTML::SimpleLinkExtor->new();
        $extor->parse_file($filename);
        #--or--
        $extor->parse($html);

        $extor->parse_file($other_file); # get more links

        $extor->clear_links; # reset the link list
        
        #extract all of the links
        @all_links   = $extor->links;

        #extract the img links
        @img_srcs    = $extor->img;

        #extract the frame links
        @frame_srcs  = $extor->frame;

        #extract the hrefs
        @area_hrefs  = $extor->area;
        @a_hrefs     = $extor->a;
        @base_hrefs  = $extor->base;
        @hrefs       = $extor->href;

        #extract the body background link
        @body_bg     = $extor->body;
        @background  = $extor->background;

DESCRIPTION

This is a simple HTML link extractor designed for the person who does not want to deal with the intricacies of HTML::Parser or the de-referencing needed to get links out of HTML::LinkExtor.

You can extract all the links or some of the links (based on the HTML tag name or attribute name). If a <BASE HREF> tag is found, all of the relative URLs will be resolved according to that reference.

This module is simply a subclass around HTML::LinkExtor, so it can only parse what that module can handle. Invalid HTML or XHTML may cause problems.

If you parse multiple files, the link list grows and contains the aggregate list of links for all of the files parsed. If you want to reset the link list between files, use the clear_links method.

$extor = HTML::SimpleLinkExtor->new()

Create the link extractor object.

$extor = HTML::SimpleLinkExtor->new($base)

Create the link extractor object and resolve the relative URLs accoridng to the supplied base URL. The supplied base URL overrides any other base URL found in the HTML.

$extor = HTML::SimpleLinkExtor->new('')

Create the link extractor object and do not resolve relative links.

$extor->parse_file( $filename )

Parse the file for links.

$extor->parse( $data )

Parse the HTML in $data.

Clear the link list. This way, you can use the same parser for another file.

Return a list of the links.

$extor->img

Return a list of the links from all the SRC attributes of the IMG.

$extor->frame

Return a list of all the links from all the SRC attributes of the FRAME.

$extor->iframe

Return a list of all the links from all the SRC attributes of the IFRAME.

$extor->frames

Returns the combined list from frame and iframe.

$extor->src

Return a list of the links from all the SRC attributes of any tag.

$extor->a

Return a list of the links from all the HREF attributes of the A tags.

$extor->area

Return a list of the links from all the HREF attributes of the AREA tags.

$extor->base

Return a list of the links from all the HREF attributes of the BASE tags. There should only be one.

$extor->href

Return a list of the links from all the HREF attributes of any tag.

$extor->body, $extor->background

Return the link from the BODY tag's BACKGROUND attribute.

$extor->script

Return the link from the SCRIPT tag's SRC attribute

TO DO

This module doesn't handle all of the HTML tags that might have links. If someone wants those, I'll add them, or you can edit %AUTO_METHODS in the source.

CREDITS

Will Crain who identified a problem with IMG links that had a USEMAP attribute.

SOURCE AVAILABILITY

This source is part of a SourceForge project which always has the latest sources in CVS, as well as all of the previous releases.

        http://sourceforge.net/projects/brian-d-foy/

If, for some reason, I disappear from the world, one of the other members of the project can shepherd this module appropriately.

AUTHORS

brian d foy, <bdfoy@cpan.org>

COPYRIGHT

Copyright (c) 2004-2005 brian d foy. All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.