NAME
XWI.pm - class for internal representation of a document record
SYNOPSIS
use Combine::XWI;
$xwi = new Combine::XWI;
#single value record variables
$xwi->server($server);
my $server = $xwi->server();
#original content
$xwi->content(\$html);
my $text = ${$xwi->content()};
#multiple value record variables
$xwi->meta_add($name1,$value1);
$xwi->meta_add($name2,$value2);
$xwi->meta_rewind;
my ($name,$content);
while (1) {
($name,$content) = $xwi->meta_get;
last unless $name;
}
DESCRIPTION
Provides methods for storing and retrieving structured records representing crawled documents.
METHODS
new()
XXX($val)
Saves $val using AUTOLOAD. Can later be retrieved, eg
$xwi->MyVar('My value');
$t = $xwi->MyVar;
will set $t to 'My value'
*_reset()
Forget all values.
*_rewind()
*_get will start with the first value.
*_add
stores values into the datastructure
*_get
retrieves values from the datastructure
meta_reset() / meta_rewind() / meta_add() / meta_get()
Stores the content of Meta-tags
Takes/Returns 2 parameters: Name, Content
$xwi->meta_add($name1,$value1);
$xwi->meta_add($name2,$value2);
$xwi->meta_rewind;
my ($name,$content);
while (1) {
($name,$content) = $xwi->meta_get;
last unless $name;
}
xmeta_reset() / xmeta_rewind() / xmeta_add() / xmeta_get()
Extended information from Meta-tags. Not used.
url_remove() / url_reset() / url_rewind() / url_add() / url_get()
Stores all URLs (ie if multiple URLs for the same page) for this record
Takes/Returns 1 parameter: URL
heading_reset() / heading_rewind() / heading_add() / heading_get()
Stores headings from HTML documents
Takes/Returns 1 parameter: Heading text
link_reset() / link_rewind() / link_add() / link_get()
Stores links from documents
Takes/Returns 5 parameters: URL, netlocid, urlid, Anchor text, Link type
robot_reset() / robot_rewind() / robot_add() / robot_get()
Stores calculated information, like genre, language, etc
Takes/Returns 2 parameters Name, Value. Both are strings with max length Name: 15, Value: 20
topic_reset() / topic_rewind() / topic_add() / topic_get()
Stores result of topic classification.
Takes/Returns 5 parameters: Class, Absolute score, Normalized score, Terms, Algorithm id
Class, Terms, and Algorithm id are strings with max lengths Class: 50, and Algorithm id: 25
Absolute score, and Normalized score are integers
Normalized score and Terms are optional and may be replaced with 0, and '' respectively
SEE ALSO
Combine focused crawler main site http://combine.it.lth.se/
AUTHOR
Yong Cao <tsao@munin.ub2.lu.se> v0.05 1997-03-13
Anders Ardö, <anders.ardo@it.lth.se>
COPYRIGHT AND LICENSE
Copyright (C) 2005,2006 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 424:
Non-ASCII character seen before =encoding in 'Ardö,'. Assuming CP1252