View on
MetaCPAN is shutting down
For details read Perl NOC. After June 25th this page will redirect to
Glenn Wood > Bundle-WWW-Scraper-Job-0.01 > WWW::Scraper::JustTechJobs



Annotate this POD

View/Report Bugs
Module Version: 1.06   Source  


WWW::Scraper::JustTechJobs - Scrapes Just*


    require WWW::Search;
    $search = new WWW::Scraper('JustTechJobs');


This class is an JustTechJobs specialization of WWW::Search. It handles making and interpreting Just*Jobs searches http://www.Just* (where * is 'Perl', 'Java', etc).


search_debug, search_parse_debug, search_ref Specified at WWW::Search.


WWW::Scraper::JustTechJobs is written and maintained by Glenn Wood,


Copyright (c) 2001 Glenn Wood All rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

XML Scaffolding ^

Look at the idea from the perspective of the XML "scaffold" I'm suggesting for parsing the response HTML.

(This is XML, but looks superficially like HTML)

<HTML> <BODY> <TABLE NAME="name" or NUMBER="number"> <TR TYPE="header"/> <TR TYPE = "detail*"> <TD BIND="title" /> <TD BIND="description" /> <TD BIND="location" /> <TD BIND="url" PARSE="anchor" /> </TR> </TABLE> </BODY> </HTML>

This scaffold describes the relevant skeleton of an HTML document; there's HTML and BODY elements, of course. Then the <TABLE> entry tells our parser to skip to the TABLE in the HTML named "name", or skip "number" TABLE entries (default=0, to pick up first TABLE element.) Then the TABLE is described. The first <TR> is described as a "header" row. The parser throws that one away. The second <TR> is a "detail" row (the "*" means multiple detail rows, of course). The parser picks up each <TD> element, extracts it's content, and places that in the hash entry corresponding to its BIND= attribute. Thus, the first TD goes into $result->_elem('title') (I needed to learn to use LWP::MemberMixin. Thanks, another lesson learned!) The second TD goes into $result->_elem('description'), etc. (Of course, some of these are _elem_array, but these details will be resolved later). The PARSE= in the url TD suggests a way for our parser to do special handling of a data element. The generic scaffold parser would take this XML and convert it to a hash/array to be processed at run time; we wouldn't actually use XML at run time. A backend author would use that hash/array in his native_setup_search() code, calling the "scaffolder" scanner with that hash as a parameter.

As I said, this works great if the response is TABLE structured, but I haven't seen any responses that aren't that way already.

This converts to an array tree that looks like this:

    my $scaffold = [ 'HTML', 
                     [ [ 'BODY', 
                       [ [ 'TABLE', 'name' ,                  # or 'name' = undef; multiple <TABLE number=n> mean n 'TABLE's here ,
                         [ [ 'NEXT', 1, 'NEXT &gt;' ] ,       # meaning how to find the NEXT button.
                           [ 'TR', 1 ] ,                      # meaning "header".
                           [ 'TR', 2 ,                        # meaning "detail*"
                             [ [ 'TD', 1, 'title' ] ,         # meaning clear text binding to _elem('title').
                               [ 'TD', 1, 'description' ] ,
                               [ 'TD', 1, 'location' ] ,
                               [ 'TD', 2, 'url' ]             # meaning anchor parsed text binding to _elem('title').
                         ] ]
                       ] ]
                     ] ]
syntax highlighting: