SD_SQL
Reimplementation of sd.pl SD.pm and SDQ.pm using MySQL contains both recyc and guard
Basic idea is to have a table (urldb) that contains most URLs ever inserted into the system together with a lock (the guard function) and a boolean harvest-flag. Also in this table is the host part together with its lock. URLs are selected from this table based on urllock, netloclock and harvest and inserted into a queue (table que). URLs from this queue are then given out to harvesters. The queue is implemented as: # The admin table can be used to generate sequence numbers like this: #mysql> update admin set queid=LAST_INSERT_ID(queid+1); # and used to extract the next URL from the queue #mysql> select host,url from que where queid=LAST_INSERT_ID(); # When the queue is empty it is filled from table urldb. Several different algorithms can be used to fill it (round-robin, most urls, longest time since harvest, ...). Since the harvest-flag and guard-lock are not updated until the actual harvest is done it is OK to delete the queue and regenerate it anytime.
########################## #Questions, ideas, TODOs, etc #Split table urldb into 2 tables - one for urls and one for hosts??? #Less efficient when filling que; more efficient when updating netloclock #Datastruktur TABLE hosts: create table hosts( host varchar(50) not null default '', netloclock int not null, retries int not null default 0, ant int not null default 0, primary key (host), key (ant), key (netloclock) );
############# Handle to many retries?
algorithm takes an url from the host that was accessed longest ago ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE hosts.hostlock < UNIX_TIMESTAMP() hosts.host=urls.host AND urls.urllock < UNIX_TIMESTAMP() AND urls.harvest=1 ORDER BY hostlock LIMIT 1; algorithm takes an url from the host with most URLs ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE hosts.hostlock < UNIX_TIMESTAMP() hosts.host=urls.host AND urls.urllock < UNIX_TIMESTAMP() AND urls.harvest=1 ORDER BY host.ant DESC LIMIT 1; algorithm takes an url from any available host ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE hosts.hostlock < UNIX_TIMESTAMP() hosts.host=urls.host AND urls.urllock < UNIX_TIMESTAMP() AND urls.harvest=1 LIMIT 1;
Anders Ardö <anders.ardo@it.lth.se>
Copyright (C) 2005,2006 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
1 POD Error
The following errors were encountered while parsing the POD:
Non-ASCII character seen before =encoding in 'Ardö'. Assuming CP1252
To install Combine::UA, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Combine::UA
CPAN shell
perl -MCPAN -e shell install Combine::UA
For more information on module installation, please visit the detailed CPAN module installation guide.