NAME

HTML::LinkExtractor - Extract links from an HTML document

DESCRIPTION

HTML::LinkExtractor is used for extracting links from HTML. It is very similar to HTML::LinkExtor, except that besides getting the URL, you also get the link-text.

Example ( please run the examples ):

    use HTML::LinkExtractor;
    use Data::Dumper;

    my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
    my $LX = new HTML::LinkExtractor();

    $LX->parse(\$input);

    print Dumper($LX->links);
    __END__
    # the above example will yield
    $VAR1 = [
              {
                '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
                'href' => bless(do{\(my $o = 'http://perl.com/')}, 'URI::http'),
                'tag' => 'a'
              }
            ];

HTML::LinkExtractor will also correctly extract nested link-type tags.

SYNOPSIS

    ## the demo
    perl LinkExtractor.pm
    perl LinkExtractor.pm file.html othefile.html

    ## or if the module is installed, but you don't know where

    perl -MHTML::LinkExtractor -e" system $^X, $INC{q{HTML/LinkExtractor.pm}} "
    perl -MHTML::LinkExtractor -e' system $^X, $INC{q{HTML/LinkExtractor.pm}} '

    ## or

    use HTML::LinkExtractor;
    use LWP::Simple qw( get );

    my $base = 'http://search.cpan.org';
    my $html = get($base.'/recent');
    my $LX = new HTML::LinkExtractor();

    $LX->parse(\$html);

    print qq{<base href="$base">\n};

    for my $Link( @{ $LX->links } ) {
    ## new modules are linked  by /author/NAME/Dist
        if( $$Link{href}=~ m{^\/author\/\w+} ) {
            print $$Link{_TEXT}."\n";
        }
    }

    undef $LX;
    __END__

    ## or

    use HTML::LinkExtractor;
    use Data::Dumper;

    my $input = q{If <a href="http://perl.com/"> I am a LINK!!! </a>};
    my $LX = new HTML::LinkExtractor(
        sub {
            print Data::Dumper::Dumper(@_);
        },
        'http://perlFox.org/',
    );

    $LX->parse(\$input);
    $LX->strip(1);
    $LX->parse(\$input);
    __END__

METHODS

`$LX->new([\&callback, [$baseUrl, [1]]])`

Accepts 3 arguments, all of which are optional. If for example you want to pass a $baseUrl, but don't want to have a callback invoked, just put undef in place of a subref.

This is the only class method.

a callback ( a sub reference, as in sub{}, or \&sub) which is to be called each time a new LINK is encountered ( for @HTML::LinkExtractor::TAGS_IN_NEED this means after the closing tag is encountered )

The callback receives an object reference($LX) and a link hashref.
and a base URL ( URI->new, so its up to you to make sure it's valid which is used to convert all relative URI's to absolute ones.
```
    $ALinkP{href} = URI->new_abs( $ALink{href}, $base );
```
A "boolean" (just stick with 1). See the example in "DESCRIPTION". Normally, you'd get back _TEXT that looks like
```
    '_TEXT' => '<a href="http://perl.com/"> I am a LINK!!! </a>',
```
If you turn this option on, you'll get the following instead
```
    '_TEXT' => ' I am a LINK!!! ',
```
The private utility function _stripHTML does this by using HTML::TokeParsers method get_trimmed_text.

You can turn this feature on an off by using $LX->strip(undef || 0 || 1)

`$LX->parse( $filename || *FILEHANDLE || \$FileContent )`

Each time you call parse, you should pass it a $filename a *FILEHANDLE or a \$FileContent

Each time you call parse a new HTML::TokeParser::Simple object is created and stored in $this->{_tp}.

You shouldn't need to mess with the TokeParser object.

`$LX->links()`

Only after you call parse will this method return anything. This method returns a reference to an ArrayOfHashes, which basically looks like (Data::Dumper output)

    $VAR1 = [ { type => 'img', src => 'image.png' }, ];

Please note that if yo provide a callback this array will be empty.

`$LX->strip( [ 0 || 1 ])`

If you pass in undef (or nothing), returns the state of the option. Passing in a true or false value sets the option.

If you wanna know what the option does see $LX->new([\&callback, [$baseUrl, [1]]])

WHAT'S A LINK-type tag

Take a look at %HTML::LinkExtractor::TAGS to see what I consider to be link-type-tag.

Take a look at @HTML::LinkExtractor::VALID_URL_ATTRIBUTES to see all the possible tag attributes which can contain URI's (the links!!)

Take a look at @HTML::LinkExtractor::TAGS_IN_NEED to see the tags for which the '_TEXT' attribute is provided, like <a href="#"> TEST </a>

How can that be?!?!

I took at look at %HTML::Tagset::linkElements and the following URL's

    http://www.blooberry.com/indexdot/html/tagindex/all.htm

    http://www.blooberry.com/indexdot/html/tagpages/a/a-hyperlink.htm
    http://www.blooberry.com/indexdot/html/tagpages/a/applet.htm
    http://www.blooberry.com/indexdot/html/tagpages/a/area.htm

    http://www.blooberry.com/indexdot/html/tagpages/b/base.htm
    http://www.blooberry.com/indexdot/html/tagpages/b/bgsound.htm

    http://www.blooberry.com/indexdot/html/tagpages/d/del.htm
    http://www.blooberry.com/indexdot/html/tagpages/d/div.htm

    http://www.blooberry.com/indexdot/html/tagpages/e/embed.htm
    http://www.blooberry.com/indexdot/html/tagpages/f/frame.htm

    http://www.blooberry.com/indexdot/html/tagpages/i/ins.htm
    http://www.blooberry.com/indexdot/html/tagpages/i/image.htm
    http://www.blooberry.com/indexdot/html/tagpages/i/iframe.htm
    http://www.blooberry.com/indexdot/html/tagpages/i/ilayer.htm
    http://www.blooberry.com/indexdot/html/tagpages/i/inputimage.htm

    http://www.blooberry.com/indexdot/html/tagpages/l/layer.htm
    http://www.blooberry.com/indexdot/html/tagpages/l/link.htm

    http://www.blooberry.com/indexdot/html/tagpages/o/object.htm

    http://www.blooberry.com/indexdot/html/tagpages/q/q.htm

    http://www.blooberry.com/indexdot/html/tagpages/s/script.htm
    http://www.blooberry.com/indexdot/html/tagpages/s/sound.htm

    And the special cases 

    <!DOCTYPE HTML SYSTEM "http://www.w3.org/DTD/HTML4-strict.dtd">
    http://www.blooberry.com/indexdot/html/tagpages/d/doctype.htm

    and

    <meta HTTP-EQUIV="Refresh" CONTENT="5; URL=http://www.foo.com/foo.html">
    http://www.blooberry.com/indexdot/html/tagpages/m/meta.htm

AUTHOR

podmaster (see CPAN) aka crazyinsomniac@yahoo.com

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

NAME

DESCRIPTION

SYNOPSIS

METHODS

$LX->new([\&callback, [$baseUrl, [1]]])

$LX->parse( $filename || *FILEHANDLE || \$FileContent )

$LX->links()

$LX->strip( [ 0 || 1 ])