
WWW::CheckSite::Validator - A spider that assesses 'kwalitee' for a site

use WWW::CheckSite::Validator;
my $wcv = WWW::CheckSite::Validator->new(
uri => 'http://www.test-smoke.org'
);
while ( my $info = $wcv->get_page ) {
# handle the info
}

This is a subclass of WWW::CheckSite::Spider.
WWW::CheckSite::Validator starts its work after the spider has fetched the page. It will check these things:
All links on the page (<a href>, <area href>, <frame src>) are checked for availability.
All images on the page (<img src>, <input type=image>) are checked for availability.
All stylesheets on the page (<link rel=stylesheet type=text/css>) are checked for availability.
The contents of the page are send to http://validator.w3.org for validation.

Extend WWW::CheckSite::Spider->new to check for Image::Info so we can do a basic check on the images.
This method overrides the WWW::CheckSite::Spider::process_page() method to check on the availability of links, images and stylesheets. When specified it will also send the page for validation by W3.ORG.
On top of the standard information it returns more:
The check_links() method gets information about the links on this page. If there is no return status, it will HEAD the uri and update the cache status for this link to prevent multiple HEADing.
NOTE: This method does not respect the exclusion rules, and only robot-rules with strictrules enabled!
The structure for links:
a/area tagThe check_images() method gets information about the images on the page. The list comes from the images() method of the mechanize object. It will only HEAD the uri.
The structure for images:
img/input tagThe check_styles() method checks the validity of stylesheets used in the page. We check for <link rel="stylesheet" type="text/css"> tags.
The structure for stylesheets:
The validate() method sends the url/contents off to W3.org to validate.
The fallback do-not-validate method.
Sends only the uri to W3.ORG and get the validation result.
Create a temporary file (with File::Temp) from $agent->content, call the validator with that temporary file and save the result (as a boolean) in $stats->{validate}.
Use the xmllint(1) program to validate the (X)HTML.
Dispatch the validation to the right method.
The fallback do-not-validate-stylesheet method.
Sends only the uri to JIGSAW.W3.ORG and get the validation result.
Create a temporary file (with File::Temp) from $ua->content, call the validator with that temporary file and return the result.
This is more like a basic consistency check, that uses Image::Info::image_info().
Check if the content-type is "validatable".
Why?

WWW::CheckSite::Spider, WWW::CheckSite

Abe Timmerman, <abeltje@cpan.org>

Please report any bugs or feature requests to bug-WWW-CheckSite@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

Copyright MMV Abe Timmerman, All Rights Reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.