The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

XWI.pm - class for internal representation of a document record

SYNOPSIS

 use Combine::XWI;
 $xwi = new Combine::XWI;

 #single value record variables
 $xwi->server($server);

 my $server = $xwi->server();

 #original content
 $xwi->content(\$html);

 my $text = ${$xwi->content()};

 #multiple value record variables
 $xwi->meta_add($name1,$value1);
 $xwi->meta_add($name2,$value2);

 $xwi->meta_rewind;
 my ($name,$content);
 while (1) {
  ($name,$content) = $xwi->meta_get;
  last unless $name;
 } 

DESCRIPTION

Provides methods for storing and retrieving structured records representing crawled documents.

METHODS

new()

XXX($val)

Saves $val using AUTOLOAD. Can later be retrieved, eg

    $xwi->MyVar('My value');
    $t = $xwi->MyVar;

will set $t to 'My value'

*_reset()

Forget all values.

*_rewind()

*_get will start with the first value.

*_add

stores values into the datastructure

*_get

retrieves values from the datastructure

meta_reset() / meta_rewind() / meta_add() / meta_get()

Stores the content of Meta-tags

Takes/Returns 2 parameters: Name, Content

 $xwi->meta_add($name1,$value1);
 $xwi->meta_add($name2,$value2);

 $xwi->meta_rewind;
 my ($name,$content);
 while (1) {
  ($name,$content) = $xwi->meta_get;
  last unless $name;
 } 

xmeta_reset() / xmeta_rewind() / xmeta_add() / xmeta_get()

Extended information from Meta-tags. Not used.

url_remove() / url_reset() / url_rewind() / url_add() / url_get()

Stores all URLs (ie if multiple URLs for the same page) for this record

Takes/Returns 1 parameter: URL

heading_reset() / heading_rewind() / heading_add() / heading_get()

Stores headings from HTML documents

Takes/Returns 1 parameter: Heading text

Stores links from documents

Takes/Returns 5 parameters: URL, netlocid, urlid, Anchor text, Link type

robot_reset() / robot_rewind() / robot_add() / robot_get()

Stores calculated information, like genre, language, etc

Takes/Returns 2 parameters Name, Value. Both are strings with max length Name: 15, Value: 20

topic_reset() / topic_rewind() / topic_add() / topic_get()

Stores result of topic classification.

Takes/Returns 5 parameters: Class, Absolute score, Normalized score, Terms, Algorithm id

Class, Terms, and Algorithm id are strings with max lengths Class: 50, and Algorithm id: 25

Absolute score, and Normalized score are integers

Normalized score and Terms are optional and may be replaced with 0, and '' respectively

SEE ALSO

Combine focused crawler main site http://combine.it.lth.se/

AUTHOR

Yong Cao <tsao@munin.ub2.lu.se> v0.05 1997-03-13

Anders Ardö, <anders.ardo@it.lth.se>

COPYRIGHT AND LICENSE

Copyright (C) 2005,2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/

1 POD Error

The following errors were encountered while parsing the POD:

Around line 424:

Non-ASCII character seen before =encoding in 'Ardö,'. Assuming CP1252