The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::Miner - This Module 'Mines' (hopefully) useful information for an URL or HTML snippet.

VERSION

Version 1.02

SYNOPSIS

HTML::Miner 'Mines' (hopefully) useful information for an URL or HTML snippet. The following is a list of HTML elements that can be extracted:

  • Find all links and for each link extract:

    URL Title
    URL href
    URL Anchor Text
    URL Domain
    URL Protocol
    URL URI
    URL Absolute location
  • Find all images and for each image extract:

    IMG Source URL
    IMG Absolute Source URL
    IMG Source Domain
  • Extracts Meta Elements such as

    Page Title
    Page Description
    Page Keywords
    Page RSS Feeds
  • Finds the final destination URL of a potentially redirecting URL.

  • Find all JS and CSS files used within the HTML and find their absolute URL if required.

Example ( Object Oriented Usage )

    use HTML::Miner;

    my $html = "some html";
    # or $html = do{local $/;<DATA>}; with __DATA__ provided

    my $html_miner = HTML::Miner->new ( 

      CURRENT_URL                   => 'www.perl.org'   , 
      CURRENT_URL_HTML              => $html 

    );


    my $meta_data =  $html_miner->get_meta_elements()   ;
    my $links     = $html_miner->get_links()            ;
    my $images    = $html_miner->get_images()           ;

    my ( $clear_url, $protocol, $domain, $uri ) = $html_miner->break_url();  

    my $css_and_js =  $html_miner->get_page_css_and_js() ;

    my $out = HTML::Miner::get_redirect_destination( "redirectingurl_here.html" ) ;

    my $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "../../about/" );

Example ( Direct access of Methods )

    use HTML::Miner;

    my $html = "some html";
    # or $html = do{local $/;<DATA>}; with __DATA__ provided

    my $url = "http://www.perl.org";

    my $meta_data  = HTML::Miner::get_meta_elements( $url, $html ) ;
    my $links      = HTML::Miner::get_links( $url, $html )         ;
    my $images     = HTML::Miner::get_images( $url, $html )        ;

    my ( $clear_url, $protocol, $domain, $uri ) = HTML::Minerbreak_url( $url );  

    my $css_and_js = get_page_css_and_js( 
           URL                       =>    $url                     , 
           HTML                      =>    $optionally_html_of_url  ,   
           CONVERT_URLS_TO_ABS       =>    0/1                      ,  [ Optional argument, default is 1 ]
    );

    my $out = HTML::Miner::get_redirect_destination( "redirectingurl_here.html" ) ;

    my $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "../../about/" );

Test Data

    __DATA__

      <html>
      <head>
          <title>SiteTitle</title>
          <meta name="description" content="desc of site" />
          <meta name="keywords"    content="kw1, kw2, kw3" />
          <link rel="alternate" type="application/atom+xml" title="Title" href="http://www.my_domain_to_mine.com/feed/atom/" />
          <link rel="alternate" type="application/rss+xml" title="Title" href="http://www.othersite.com/feed/" />
          <link rel="alternate" type="application/rdf+xml" title="Title" href="my_domain_to_mine.com/feed/" /> 
          <link rel="alternate" type="text/xml" title="Title" href="http://www.other.org/feed/rss/" />
          <script type="text/javascript" src="http://static.myjsdomain.com/frameworks/barlesque.js"></script>
          <script type="text/javascript" src="http://js.revsci.net/gateway/gw.js?csid=J08781"></script>
          <script type="text/javascript" src="/about/other.js"></script>
          <link rel="stylesheet" type="text/css" href="http://static.mycssdomain.com/frameworks/style/main.css"  />
      </head>
      <body>
      
      <a href="http://linkone.com">Link1</a>
      <a href="link2.html" TITLE="title2" >Link2</a>
      <a href="/link3">Link3</a>
      
      
      <img src="http://my_domain_to_mine.com/logo_plain.jpg" >
      <img alt="image2" src="http://my_domain_to_mine.com/image2.jpg" />
      <img src="http://my_other.com/image3.jpg" alt="link3">
      <img src="image3.jpg" alt="link3">
      
      
      </body>
      </html>

Example Output:

    my $meta_data =  $html_miner->get_meta_elements() ;

    # $meta_data->{ TITLE }             =>   "SiteTitle"
    # $meta_data->{ DESC }              =>   "desc of site"
    # $meta_data->{ KEYWORDS }->[0]     =>   "kw1"
    # $meta_data->{ RSS }->[0]->{TYPE}  =>   "application/atom+xml"



    my $links = $html_miner->get_links();

    # $links->[0]->{ DOMAIN }         =>   "linkone.com"
    # $links->[0]->{ ANCHOR }         =>   "Link1"
    # $links->[2]->{ ABS_URL   }      =>   "http://my_domain_to_mine.com/link3"
    # $links->[1]->{ DOMAIN_IS_BASE } =>   1
    # $links->[1]->{ TITLE }          =>   "title2"



    my $images = $html_miner->get_images();

    # $images->[0]->{ IMG_LOC }     =>  "http://my_domain_to_mine.com/logo_plain.jpg"
    # $images->[2]->{ ALT }         =>  "link3"
    # $images->[0]->{ IMG_DOMAIN }  =>  "my_domain_to_mine.com"
    # $images->[3]->{ ABS_LOC }     =>  "http://my_domain_to_mine.com/image3.jpg"



    my $css_and_js =  $html_miner->get_page_css_and_js(
         CONVERT_URLS_TO_ABS       =>    0
    );

    # $css_and_js will contain:
    #    {
    #      CSS => [
    #         "http://static.mycssdomain.com/frameworks/style/main.css",
    #         "/rel_cssfile.css",
    #        ],
    #      JS  => [
    #          "http://static.myjsdomain.com/frameworks/barlesque.js",
    #          "http://js.revsci.net/gateway/gw.js?csid=J08781",
    #          "/about/rel_jsfile.js",
    #        ],
    #    }


    my $css_and_js =  $html_miner->get_page_css_and_js(
         CONVERT_URLS_TO_ABS       =>    1
    );

    # $css_and_js will contain:
    #    {
    #      CSS => [
    #         "http://static.mycssdomain.com/frameworks/style/main.css",
    #         "http://www.perl.org/rel_cssfile.css",
    #        ],
    #      JS  => [
    #          "http://static.myjsdomain.com/frameworks/barlesque.js",
    #          "http://js.revsci.net/gateway/gw.js?csid=J08781",
    #          "http://www.perl.org/about/rel_jsfile.js",
    #        ],
    #    }



    my ( $clear_url, $protocol, $domain, $uri ) = $html_miner->break_url();  

    # $clear_url   =>  "http://my_domain_to_mine.com/my_page_to_mine.pl"
    # $protocol    =>  "http"
    # $domain      =>  "my_domain_to_mine.com"
    # $uri         =>  "/my_page_to_mine.pl"


    HTML::Miner::get_redirect_destination( "redirectingurl_here.html" ) => 'redirected_to'



    my $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "../../about/" );
    # $out    => "http://www.perl.com/about/"

    $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/index.html", "index2.html" );
    # $out    => "http://www.perl.com/help/faq/index2.html"

    $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "../../index.html" );
    # $out    => "http://www.perl.com/index.html"

    $out = HTML::Miner::get_absolute_url( "www.perl.com/help/faq/", "/about/" );
    # $out    => "http://www.perl.com/about/"

    $out = HTML::Miner::get_absolute_url( "www.perl.comhelp/faq/", "http://othersite.com" );
    # $out    => "http://othersite.com/"

EXPORT

This Module does not export anything through @EXPORT, however does export all externally available functions through @EXPORT_OK

SUBROUTINES/METHODS

The following functions are all available directly and through the HTML::Miner Object.

new

The constructor validates the input data and retrieves a URL if the HTML is not provided.

The constructor takes the following parameters:

  my $foo = HTML::Miner->new ( 
      CURRENT_URL                   => 'www.site_i_am_crawling.com/page_i_am_crawling.html'   , # REQUIRED - 'new' will croak 
                                                                                                  #           if this is not provided. 
      CURRENT_URL_HTML              => 'long string here'                                     , # Optional -  Will be extracted 
                                                                                                  #      from CURRENT_URL if not provided. 
      USER_AGENT                    => 'Perl_HTML_Miner/$VERSION'                             , # Optional - default: 
                                                                                                  #      'Perl_HTML_Miner/$VERSION'
      TIMEOUT                       => 5                                                      , # Optional - default: 5 ( Seconds )

      DEBUG                         => 0                                                      , # Optional - default: 0

  );

This function extracts all URLs from a web page.

Syntax:

   When called on an HTML::Miner Object :
 
          $retun_element = $html_miner->get_links();

   When called directly                 :

          $retun_element = get_links( $url, $optionally_html_of_url );

   The direct call is intended to be a simplified version of OO call 
       and so does not allow for customization of the useragent and so on!

Output:

This function ( regardless of how its called ) returns a pointer to an Array of Hashes who's structure is as follows:

    $->Array( 
       Hash->{ 
           "URL"             => "extracted url"                       ,
           "ABS_EXISTS"      => "0_if_abs_url_extraction_failed"      , 
           "ABS_URL"         => "absolute_location_of_extracted_url"  ,
           "TITLE"           => "title_of_this_url"                   , 
           "ANCHOR"          => "anchor_text_of_this_url"             ,
           "DOMAIN"          => "domain_of_this_url"                  ,
           "DOMAIN_IS_BASE"  => "1_if_this_domain_same_as_base_domain ,
           "PROTOCOL"        => "protocol_of_this_domain"             ,
           "URI"             => "URI_of_this_url"                     ,
       }, 
         ... 
    )

So, to access the title of the second URL found you would use (yes the order is maintained):

     @{ $retun_element }[1]->{ TITLE }

NOTES:

    If ABS_EXISTS is 0 then DOMAIN, DOMAIN_IS_BASE, PROTOCOL and URI will be undefined

    To extract URLs from a HTML snippet when one does not care about the url of that page, simply pass some garbage as the URL 
         and ignore everything except URL, TITLE and ANCHOR

    "ANCHOR" might contain HTML such as <span>, use HTML::Strip if required. 

get_page_css_and_js

This function extracts all CSS style sheets and JS Script files use on a web page.

Syntax:

   When called on an HTML::Miner Object :
 
          $retun_element = $html_miner->get_page_css_and_js(
               CONVERT_URLS_TO_ABS       =>    0/1                         [ B<Optional> argument, default is 1 ]
          );

   When called directly                 :

          $retun_element = get_page_css_and_js( 
               URL                       =>    $url                     , 
               HTML                      =>    $optionally_html_of_url  ,  [ B<Optional> argument, html extracted if not provided ] 
               CONVERT_URLS_TO_ABS       =>    0/1                      ,  [ B<Optional> argument, default is 1                   ]
          );

   The direct call is intended to be a simplified version of OO call 
       and so does not allow for customization of the useragent and so on!

Output:

This function ( regardless of how its called ) returns a pointer to a Hash [ JS or CSS ] of Arrays containing the URLs

    $->HASH->{ 
          "CSS"   => Array( "extracted url1", "extracted url2", .. )
          "JS"    => Array( "extracted url1", "extracted url2", .. )
      }

So, to access the URL of the second CSS style sheet found you would use (again the order is maintained):

     $$retun_element{ "CSS" }[1];

Or $css_data = @{ $retun_element->{ "CSS" } } ; $second_css_url_found = $css_data[1] ;

NOTES:

To extract CSS and JS links from a HTML snippet when one does not care about the url of that page, simply set CONVERT_URLS_TO_ABS to 0 and everything should be fine.

get_absolute_url

This function takes as arguments the base URL whithin the HTML of which a second (possibly relative URL ) URL was found, and returns the absolute location of that second URL.

Example:

    my $out = HTML::Miner::get_absolute_url( "www.perl.com/help/fag/", "../../about/" )

    Will return:

          www.perl.com/about/

NOTE:

    This function cannot be called on the HTML::Miner Object. 
    The function get_links does this for all URLs found on a webpage. 

break_url

This function, given an URL, returns the Domain, Protocol, URI and the input URL in its 'standard' form.

Syntax:

It is called on the HTML::Miner Object as follows:

    my ( $clear_url, $protocol, $domain, $uri ) = $break_url();

    NOTE: This will return the details of the 'CURRENT_URL'

It is called directly as follows:

    my ( $clear_url, $protocol, $domain, $uri ) = $break_url( 'www.perl.org/help/faq/' );

Output:

    Input
   
         www.perl.org/help/faq

    Output
      
         clean_url --> http://www.perl.org/help/faq/
         protocol  --> http
         domain    --> www.perl.org
         uri       --> help/faq/

get_redirect_destination

This function takes, as argument, an URL that is potentially redirected to another and another and ... URL and returns the FINAL destination URL.

This function REQUIRES access to the web.

Example:

    my $destination_url = HTML::Miner::get_redirect_destination( 
       'http://rss.cnn.com/~r/rss/edition_world/~3/403863461/index.html' , 
       'optional_user_agent',
       'optional_timeout'
    );

    $destination_url will contain:

       "http://edition.cnn.com/2008/WORLD/americas/09/26/russia.chavez/index.html?eref=edition_world"

NOTES:

   This function CANNOT be called on the HTML::Miner Object.

WARNING:

   This function is NOT thread safe, use get_redirect_destination_thread_safe ( described below ) if this function is 
     being used within a thread and there is a chance that any of the interim redirect URLs are HTTPS.

get_redirect_destination_thread_safe

This function takes, as argument, an URL that is potentially redirected to another and another and ... URL and returns the FINAL destination URL and is thread safe.

This function REQUIRES access to the web.

Example:

    my $destination_url = HTML::Miner::get_redirect_destination( 
       'on.fb.me/qoBoK' , 
       'optional_user_agent',
       'optional_timeout'
    );

    $destination_url will contain:

       "https://www.facebook.com"

NOTES:

   This function CANNOT be called on the HTML::Miner Object.
   This function hits the web for each redirect that it tracks - So to find the redirect of an URL that redirects 15 times it will
        access the web 15 times. Do NOT use this function instead of get_redirect_destination unless you have to. 

get_images

This function extracts all images from a web page.

Syntax:

   When called on an HTML::Miner Object :
 
          $retun_element = $html_miner->get_images();

   When called directly                 :

          $retun_element = get_images( $url, $optionally_html_of_url );

   The direct call is intended to be a simplified version of OO call 
       and so does not allow for customization of the useragent and so on!

Output:

This function ( regardless of how its called ) returns a pointer to an Array of Hashes who's structure is as follows:

    $->Array( 
       Hash->{ 
           "IMG_LOC"         => "extracted_image"                        ,
           "ALT"             => "alt_text_of_this_image"                 ,
           "ABS_EXISTS"      => "0_if_abs_url_extraction_failed"         , 
           "ABS_LOC"         => "absolute_location_of_extracted_image"   ,
           "IMG_DOMAIN"      => "domain_of_this_image"                   ,
           "DOMAIN_IS_BASE"  => "1_if_this_domain_same_as_base_domain    ,
       }, 
         ... 
    

)

So, to access the alt text of the second image found you would use (yes the order is maintained):

     @{ $retun_element }[1]->{ TITLE }

NOTE:

    If ABS_EXISTS is 0 then IMG_DOMAIN and DOMAIN_IS_BASE will be undefined

    To extract images from a HTML snippet when one does not care about the URL of that page, simply pass some garbage as 
           the URL and ignore everything except absolute locations and domains.

get_meta_elements

This function retrieves the following meta elements for a given URL (or HTML snippet)

    Page Title
    Meta Description
    Meta Keywords
    Page RSS Feeds

Syntax:

It is called through the HTML::Miner Object as follows:

    $return_hash = $html_miner->get_meta_elements( );

It is called directly as follows:

    $return_hash = $html_miner->get_meta_elements( 
                                    URL   => "url_of_page"  ,
                                    HTML  => "html_of_page
                                );

    Note: The above function requires either the html of the url. If the 
          HTML is provided then the URL is used to retrieve the HTML.
          If both are not provided this function will croak.

          Again this function does not allow for customization of User Agent
          and timeout when called directly. 

Output:

In either case the returned hash is of the following structure:

    $return_hash = ( 
               TITLE     =>   'title_of_page'         ,
               DESC      =>   'description_of_page'   ,
               KEYWORDS  =>   
                    'pointer to array of words'       ,
               RSS       => 
                    'pointer to Array of Hashes of RSS links' as below
     )


    $return_hash->{ RSS } = (
             [
               TYPE      => 'eg: application/atom+xml',
               TITLE     => 'Title of this RSS Feed'  ,
               URL       => 'URL of this RSS Feed'
             ],
                 ...
    )

INTERNAL SUBROUTINES/METHODS

These functions are used by the module. They are not meant to be called directly using the Net::XMPP::Client::GTalk object although there is nothing stoping you from doing that.

_get_url_html

This is an internal function and is not to be used externally.

_convert_to_valid_url

This is an internal function and is not to be used externally.

AUTHOR

Harish T Madabushi, <harish.tmh at gmail.com>

BUGS

Please report any bugs or feature requests to bug-html-miner at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=HTML-Miner. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc HTML::Miner

You can also look for information at:

ACKNOWLEDGEMENTS

Thanks to user ultranerds from http://perlmonks.org/?node_id=721567 for suggesting and helping with JS and CSS extraction.

LICENSE AND COPYRIGHT

Copyright (C) 2009 Harish Madabushi, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0

Any use, modification, and distribution of the Standard or Modified Versions is governed by this Artistic License. By using, modifying or distributing the Package, you accept this license. Do not use, modify, or distribute the Package, if you do not accept this license.

If your Modified Version has been derived from a Modified Version made by someone other than you, you are nevertheless required to ensure that your Modified Version complies with the requirements of this license.

This license does not grant you the right to use any trademark, service mark, tradename, or logo of the Copyright Holder.

This license includes the non-exclusive, worldwide, free-of-charge patent license to make, have made, use, offer to sell, sell, import and otherwise transfer the Package with respect to any patent claims licensable by the Copyright Holder that are necessarily infringed by the Package. If you institute patent litigation (including a cross-claim or counterclaim) against any party alleging that the Package constitutes direct or contributory patent infringement, then this Artistic License to you shall terminate on the date that such litigation is filed.

Disclaimer of Warranty: THE PACKAGE IS PROVIDED BY THE COPYRIGHT HOLDER AND CONTRIBUTORS "AS IS' AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES. THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, OR NON-INFRINGEMENT ARE DISCLAIMED TO THE EXTENT PERMITTED BY YOUR LOCAL LAW. UNLESS REQUIRED BY LAW, NO COPYRIGHT HOLDER OR CONTRIBUTOR WILL BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING IN ANY WAY OUT OF THE USE OF THE PACKAGE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.