Search::Circa::Parser - provide functions to parse HTML pages by Circa
use Search::Circa::Indexer; my $index = new Search::Circa::Indexer; $index->connect(...); $index->Parser->look_at({ url => url, idr => account });
This module use HTML::Parser facilities. It's call by Search::Circa::Indexer for index each document. Main method is look_at
.
Create a new Circa::Parser object with indexer instance properties
Index an url. Job done is:
Keys for refHashParameters:
Url to read
Id of url in table links
Id of account's url
(optional) : If this parameter is set, Circa didn't make any job on this page if it's older that the date.
(optional) Local url to reach the file
(optional) If $categorieAuto set to true, Circa will create/set the category of url with syntax of directory found. Ex: http://www.alianwebserver.com/societe/stvalentin/index.html will create and set the category for this url to Societe / StValentin. If $categorieAuto set to false, $categorie will be used.
(optional) Depth of actual link.
(optional) See $categorieAuto.
Return (-1,0) if url isn't valide, number of word and number of links found else
Set user agent for Circa robot. If local is set to 0 or $self->{ConfigMoteur}->{'temporate'}==0, LWP::UserAgent will be used. Else LWP::RobotUA is used.
Split data in words, and put them in global %$RM with score. Hash structure is ('mots'=>facteur).
Method call for each HTML tag find in HTML pages.
Method call for each content of tag in HTML pages
Check if url $links will be add to Circa. Url must begin with $self->host_indexed, and his extension must be not doc,zip,ps,gif,jpg,gz, pdf,eps,png,deb,xls,ppt,class,GIF,css,js,wav,mid.
If $links is accepted, return url. Else return 0.
$Revision: 1.27 $
Alain BARBET alian@alianwebserver.com