Michael De La Rue > WWW-Link > WWW::Link

Download:
WWW-Link-0.036.tar.gz

Dependencies

Annotate this POD

Related Modules

LWP::UserAgent
HTML::LinkExtor
LWP::Simple
more...
By perlmonks.org

CPAN RT

New  2
Open  0
View/Report Bugs
Module Version: 0.036   Source  

NAME ^

WWW::Link - maintain information about the state of links

SYNOPSIS ^

       use WWW::Link;
       $::link=new WWW::Link "http://www.bounce.com/";
       $::link->failed_test;
       $::link->is_okay or warn "link not validated";

DESCRIPTION ^

WWW::Link is a perl class which accepts and maintains information about links. For example, this would include urls which are referenced from a WWW page.

The link class will be acted on by such programs as link checkers to give it information and by other programs to convert that information into something which can be used by humans.

METHODS ^

new

The constructor for links expects a url as a string.

status

The status effectively a bit field. Some of the options are mutually exculsive

  guessed similar ; followed link on page ;

   couldnt_test - for some reason the link tester has been unable to
                   check the status of the link

  robot exlusion ; server overload ; known network break

initialise

Setup each of the variables into a best guess starting state

status

There are a number of options for checking the status of a link. These will maintain meaning although the details of how they do their tests are likely to vary.

is_okay

The link is not considered to have been damaged. N.B. this could just mean that we haven't checked it yet. Use validated okay to verify that.

is_not_tested

The link has not been examined and the system doesn't know if it is good or bad.

is_abandoned

We've been testing the link and finding it broken for so long we aren't interested in it any more.

is_broken

After repeated attempts (as defined by user) to validate it, no answer was recieved and the link is considered broken.

is_damaged

The link was broken recently, but we still think that it needs more time before we can consider it broken.

is_redirected

The link was examined and an explicit redirect was found.

validated_okay

The link has been examined and was definitely okay.

add_status

or the given value into the status flags.

remove_status

or the given value into the status flags.

status_change_time

Return the last time that the status field of the link was changed.

breakcount

Returns two times the number of times the link has been tested and found broken. This could in future turn into a fraction or something the basic idea is that at around 10 you should start to think that the link is broken beyond recognition..

With an argument sets the links broken number, but you shouldn't normally do this so by default it also complains unless you've set the package I_know_what_im_up_to variable.

time_want_test ^

This tells you the time till the link thinks it should next be tested. There are three regimes:-

The time which controls the next time we want to be tested is the last time we were tested. This function doesn't worry about what the real time is now and will happily return times in the past.

normal testing

In the normal situation we have a time constant for each link and we do testing on the link at that time +- one day.

damaged link

The link has just been detected as damaged. We retest it repeatedly spread across a small number of days and then declare it broken.

broken link

The link has been declared broken. Now we test it occasionally just to verify if it has been repaired in the meantime.

abandoned link

We've detected and declared it broken, but noone has come along to look at it. It's still possible that outside influences repair the link in the meantime, so we keep checking it occasionally

Please note, a link doesn't know anything about the present time, or when it is scheduled to be checked. The time it want's to be checked could be some time in the past.

$l->last_refresh([integer-time])

The last refresh is the last time the link was reported as in use by some users resource. It must be updated ever time the index to an infostructure is rebuilt or else the

add_redirect

This method adds information about a redirect from a given link.

Redirects can be a chain.

add_suggestion

Add a suggestion to the beginning of the list of suggested replacement links. If the same suggestion is later in the list delete it. We return 1 if the link is new.

redirects

Redirects stores or returns a reference to an array of redirects.

redirect_urls

redirect_urls returns redirections on a link in the form of urls (text strings, not objects). In a list context it returns the full chain of urls. In a scalar context it returns only the last url of the chain.

fix_suggestion

Fix suggestion is an array of suggestions for documents which might replace a broken link. These can be derived from all sorts of places and some are probably not correct. The aim is that they are in order from best guess to worst. You pass a reference to the new array.

all_suggestion

Returns a list consisting of all of the redirect and fix suggestions that have been made for that link.

url

just say what url is associated with this link

failed_test

Failed test should says that you have tested a link and think it's broken. Sometimes the link won't care (it's been tested recently and is waiting to give the resource time to come back if it's just temporarily mislayed); mostly it'll increase it's broken value by two.

This also creates two reliability values. The long and short. These indicate how reliable the link has been over recent tests. The long value takes into account approximately the last 30 tests and the short takes into account approximately the last 7 tests with more weighting for more recent tests. A value of 1 means totally reliably working for all time and a value of -1 means totally broken for all time and anything in between is a lower value of certainty.

Probably a value less than about 0.5 is one to consider a problem, depending on how important the Link is to you.

redirections and failed tests

There are the following possibilities: A redirected link which ends in success; considered as redirected. A redirected link which ends in failure. This should be considered broken and finally: A failed link which was previously redirected. This should be considered broken, but the redirection should be remembered as a possible solution for the problem.

passed_test

This tells a link that it has been tested and found to be okay. It's an internal method generally and may change name.

N.B. this resets all other status flags. If you want to have a link which is okay but is redirected you must call redirected afterwards.

found_redirected

tells the link that there is at least one layer of permanent redirections from its URL to the final object referred to. The urls in the source documents should be updated.

not_redirected

tells the link that there are no redirections from its URL.

disallowed

Testing the link was attempted but it was disallowed, e.g. due to the robots exclusion protocol. The user should examine what's going on and either ignore it or get in touch with the site for permission to do link checking.

N.B. disallowed should only be called when we know that testing has been disallowed. Failure to access the resource at the end of a link should normally be seen as an error.

unsupported

Testing the link was attempted but it turns out that we don't know how... We just mark this as unsupported and the user can then think about sending in a patch to add the needed features to LinkController.

store_response ( <response>, <time_now>, <tester>, ..<tester data>)

This function is for storing the history of testing of the link so that we can look through it and find out what has been going on.

The <response> argument should be an HTTP response object representing the status of the tester and possibly synthesised by the tester. The time_now is the time the response is considered to have been processed.

Tester should be an identifier of the tester used to test the link. Normally this should be the class of the tester.

The tester data can be anything that the tester wants to store with the response.

N.B. mere storage of a response does not have any affect on a link.

recover_response (<integer>)

This function returns a previous response which has been applied to the link. In a scalar context it returns only the response. In an array context it will return the arguments which were given to store_response. The integer argument is the age of the link (it's position in the history).

N.B. an age of 0 returns the most recently stored response.

STORING TEST COOKIE ^

The test cookie is any data which the tester wants to store to have available next time it tests this link. Testers should normally be very careful how they handle this value and expect that another tester could use the value differently. The normal way to cope with this is to be able to work without the cookie and, when storing the cookie, use an object which can then be idenitfied easily.

If the cookie can support a time_want_test method, then this can be used to override the time the link should be tested. It will be called with a reference to the link.

DECLARING LINKS BROKEN ^

A link isn't signalled as broken until after it has been checked several times and found not working. The reason for this is quite simple. There are many WWW servers in the world which aren't reliably accessable. If a set of pages are checked at any given time a fair number of links could seem to be broken, even when they will soon be repaired. In fact, in a well maintained set of pages (as I hope this package will let you have), these pages will outnumber by a large amount the number of actual broken links.

LINK AGING ^

Links can age in two ways. Firstly, we can recognise them as broken and get bored of them being checked. However, in this case, they stay around in the database, and are just checked very rarely (we never give up hope.. there may be some reason why WE can't see a link and the user can't be bothered to solve it yet but does later.)

The second method we use is keeping a refresh time in each link. This represents the last time some user told us that this link was in their infostructure. If this gets larger than a certain value (e.g. a month, but this must be site determined depending on the maintainance patterns of users), the link should no longer be checked.

If this gets larger than another value (which should be considerably larger than the first - say 6 months or a year) then the link can be retired from the database. Even if someone did turn out to be interested, the information would be so out of date as to be useless.

SEE ALSO ^

WWW::Link::Reporter WWW::Link::Selector

syntax highlighting: