<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html >
<head><title>Crawler internal operation</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<meta name="generator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)">
<meta name="originator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)">
<!-- html,2 -->
<meta name="src" content="DocMain.tex">
<meta name="date" content="2009-06-16 09:20:00">
<link rel="stylesheet" type="text/css" href="DocMain.css">
</head><body
>
<!--l. 1--><div class="crosslinks"><p class="noindent">[<a
href="DocMainse3.html" >prev</a>] [<a
href="DocMainse3.html#tailDocMainse3.html" >prev-tail</a>] [<a
href="#tailDocMainse4.html">tail</a>] [<a
href="DocMainpa1.html# " >up</a>] </p></div>
<h3 class="sectionHead"><span class="titlemark">4 </span> <a
id="x19-250004"></a>Crawler internal operation</h3>
<!--l. 3--><p class="noindent" >The system is designed for continuous operation. The harvester processes a URL in several
steps as detailed in Figure <a
href="#x19-250032">2<!--tex4ht:ref: combinearch --></a>. As a start-up initialization the frontier has to be seeded
with some relevant URLs. All URLs are normalized before they are entered in the
database. Data can be exported in various formats including the ALVIS XML document
format<span class="footnote-mark"><a
href="DocMain20.html#fn13x0"><sup class="textsuperscript">13</sup></a></span><a
id="x19-25001f13"></a> and
Dublin Core<span class="footnote-mark"><a
href="DocMain21.html#fn14x0"><sup class="textsuperscript">14</sup></a></span><a
id="x19-25002f14"></a>
records.
<!--l. 12--><p class="indent" > <hr class="figure"><div class="figure"
><table class="figure"><tr class="figure"><td class="figure"
>
<a
id="x19-250032"></a>
<div class="center"
>
<!--l. 14--><p class="noindent" >
<!--l. 15--><p class="noindent" ><img
src="DocMain1x.png" alt="PIC" class="graphics" width="192.84303pt" height="392.65201pt" ><!--tex4ht:graphics
name="DocMain1x.png" src="CrawlerArchitecture.xfig.eps"
--></div>
<br /> <table class="caption"
><tr style="vertical-align:baseline;" class="caption"><td class="id">Figure 2: </td><td
class="content">Architecture for the Combine focused crawler.</td></tr></table><!--tex4ht:label?: x19-250032 -->
<!--l. 20--><p class="indent" > </td></tr></table></div><hr class="endfigure">
<!--l. 22--><p class="indent" > The steps taken during crawling (numbers refer to Figure <a
href="#x19-250032">2<!--tex4ht:ref: combinearch --></a>):
<ol class="enumerate1" >
<li
class="enumerate" id="x19-25005x1">The next URL is fetched from the scheduler.
</li>
<li
class="enumerate" id="x19-25007x2">Combine obeys the Robots Exclusion Protocol<span class="footnote-mark"><a
href="DocMain22.html#fn15x0"><sup class="textsuperscript">15</sup></a></span><a
id="x19-25008f15"></a>.
Rules are cached locally.
</li>
<li
class="enumerate" id="x19-25010x3">The page is retrieved using a GET, GET-IF-MODIFIED, or HEAD HTTP request.
</li>
<li
class="enumerate" id="x19-25012x4">The HTML code is cleaned and normalized.
</li>
<li
class="enumerate" id="x19-25014x5">The character-set is detected and normalized to UTF-8.
</li>
<li
class="enumerate" id="x19-25016x6">
<ol class="enumerate2" >
<li
class="enumerate" id="x19-25018x1">The page (in any of the formats PDF, PostScript, MSWord, MSExcel,
MSPowerPoint, RTF and TeX/LaTeX) is converted to HTML or plain text by
an external program.
</li>
<li
class="enumerate" id="x19-25020x2">Internal parsers handles HTML, plain text and images. This step extracts
structured information like metadata (title, keywords, description ...), HTML
links, and text without markup.</li></ol>
</li>
<li
class="enumerate" id="x19-25022x7">The document is sent to the topic filter (see section <a
href="#x19-300004.5">4.5<!--tex4ht:ref: autoclass --></a>). If the Web-page is relevant with
respect to the focus topic, processing continues with:
<ol class="enumerate2" >
<li
class="enumerate" id="x19-25024x1">Heuristics like score propagation.
</li>
<li
class="enumerate" id="x19-25026x2">Further analysis, like genre and language identification.
</li>
<li
class="enumerate" id="x19-25028x3">Updating the record database.
</li>
<li
class="enumerate" id="x19-25030x4">Updating the frontier database with HTML links and URLs extracted from
plain text.
</li></ol>
</li></ol>
<!--l. 96--><p class="indent" > Depending on several factors like configuration, hardware, network, workload, the crawler
normally processes between 50 and 200 URLs per minute.
<h4 class="subsectionHead"><span class="titlemark">4.1 </span> <a
id="x19-260004.1"></a>URL selection criteria</h4>
<!--l. 103--><p class="noindent" >In order to successfully select and crawl one URL the following conditions (in this order) have to
be met:
<ol class="enumerate1" >
<li
class="enumerate" id="x19-26002x1">The URL has to be selected by the scheduling algorithm (section <a
href="#x19-290004.4">4.4<!--tex4ht:ref: sched --></a>).<br
class="newline" />
<!--l. 109--><p class="noindent" ><span
class="ecti-1095">Relevant configuration</span>
<span
class="ecti-1095">variables: </span>WaitIntervalHost (section <a
href="DocMainse9.html#x42-1100009.1.39">9.1.39<!--tex4ht:ref: WaitIntervalHost --></a>), WaitIntervalHarvesterLockRobotRules
(section <a
href="DocMainse9.html#x42-1070009.1.36">9.1.36<!--tex4ht:ref: WaitIntervalHarvesterLockRobotRules --></a>), WaitIntervalHarvesterLockSuccess (section <a
href="DocMainse9.html#x42-1080009.1.37">9.1.37<!--tex4ht:ref: WaitIntervalHarvesterLockSuccess --></a>)
</li>
<li
class="enumerate" id="x19-26004x2">The URL has to pass the allow test.
<!--l. 116--><p class="noindent" ><span
class="ecti-1095">Relevant configuration variables: </span>allow (section <a
href="DocMainse9.html#x42-1170009.2.1">9.2.1<!--tex4ht:ref: allow --></a>)
</li>
<li
class="enumerate" id="x19-26006x3">The URL is not be excluded by the exclude test (see section <a
href="#x19-280004.3">4.3<!--tex4ht:ref: urlfilt --></a>).
<!--l. 121--><p class="noindent" ><span
class="ecti-1095">Relevant configuration variables: </span>exclude (section <a
href="DocMainse9.html#x42-1200009.2.4">9.2.4<!--tex4ht:ref: exclude --></a>)
</li>
<li
class="enumerate" id="x19-26008x4">The Robot Exclusion Protocol has to allow crawling of the URL.
</li>
<li
class="enumerate" id="x19-26010x5">Optionally the document at the URL location has to pass the topic filter (section
<a
href="#x19-300004.5">4.5<!--tex4ht:ref: autoclass --></a>).
<!--l. 129--><p class="noindent" ><span
class="ecti-1095">Relevant configuration variables: </span>classifyPlugIn (section <a
href="DocMainse9.html#x42-750009.1.4">9.1.4<!--tex4ht:ref: classifyPlugIn --></a>), doCheckRecord
(section <a
href="DocMainse9.html#x42-780009.1.7">9.1.7<!--tex4ht:ref: doCheckRecord --></a>).
</li></ol>
<!--l. 135--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.2 </span> <a
id="x19-270004.2"></a>Document parsing and information extraction</h4>
<!--l. 137--><p class="noindent" >Each document is parsed and analyzed by the crawler in order to store structured document
records in the internal MySQL database. The structure of the record includes the
fields:
<ul class="itemize1">
<li class="itemize">Title
</li>
<li class="itemize">Headings
</li>
<li class="itemize">Metadata
</li>
<li class="itemize">Plain text
</li>
<li class="itemize">Original document
</li>
<li class="itemize">Links – HTML and plain text URLs
</li>
<li class="itemize">Link anchor text
</li>
<li class="itemize">Mime-Type
</li>
<li class="itemize">Dates – modification, expire, and last checked by crawler
</li>
<li class="itemize">Web-server identification</li></ul>
<!--l. 153--><p class="indent" > Optional some extra analysis can be done, see section <a
href="#x19-380004.8">4.8<!--tex4ht:ref: analysis --></a>.
<!--l. 155--><p class="indent" > The system selects a document parser based on the Mime-Type together with available
parsers and converter programs.
<ol class="enumerate1" >
<li
class="enumerate" id="x19-27002x1">For some mime-types an external program is called in order to convert the document
to a format handled internally (HTML or plain text).
<!--l. 161--><p class="noindent" ><span
class="ecti-1095">Relevant configuration variables: </span>converters (section <a
href="DocMainse9.html#x42-1190009.2.3">9.2.3<!--tex4ht:ref: converters --></a>)
</li>
<li
class="enumerate" id="x19-27004x2">Internal parsers handle HTML, plain text, TeX, and Image.
<!--l. 165--><p class="noindent" ><span
class="ecti-1095">Relevant configuration variables: </span>converters (section <a
href="DocMainse9.html#x42-1190009.2.3">9.2.3<!--tex4ht:ref: converters --></a>)
</li></ol>
<!--l. 169--><p class="indent" > Supporting a new document format is as easy as providing a program that can convert a
document in this format to HTML or plain text. Configuration of the mapping between
document format (Mime-Type) and converter program is done in the complex configuration
variable ’converters’ (section <a
href="DocMainse9.html#x42-1190009.2.3">9.2.3<!--tex4ht:ref: converters --></a>).
<!--l. 173--><p class="indent" > Out of the box Combine handle the following document formats: plain text, HTML, PDF,
PostScript, MSWord, MSPowerPoint, MSExcel, RTF, TeX, and images.
<!--l. 190--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.3 </span> <a
id="x19-280004.3"></a>URL filtering</h4>
<!--l. 192--><p class="noindent" >Before an URL is accepted for scheduling (either by manual loading or recycling) it is normalized
and validated. This process comprises a number of steps:
<ul class="itemize1">
<li class="itemize">Normalization
<ul class="itemize2">
<li class="itemize">General practice: host-name lowercasing, port-number substitution, canonical
URL
</li>
<li class="itemize">Removing fragments (ie ’#’ and everything after that)
</li>
<li class="itemize">Cleaning CGI repetitions of parameters
</li>
<li class="itemize">Collapsing dots (’./’, ’../’) in the path
</li>
<li class="itemize">Removing CGI parameters that are session ids, as identified by patterns in the
configuration variable sessionids (section <a
href="DocMainse9.html#x42-1220009.2.6">9.2.6<!--tex4ht:ref: sessionids --></a>)
</li>
<li class="itemize">Normalizing Web-server names by resolving aliases. Identified by patterns in
the configuration variable serveralias (section <a
href="DocMainse9.html#x42-1210009.2.5">9.2.5<!--tex4ht:ref: serveralias --></a>). These patterns can be
generated by using the program <span
class="ectt-1095">combineUtil </span>to analyze a crawled corpus.</li></ul>
</li>
<li class="itemize">Validation: A URL has to pass all three validation steps outlined below.
<ul class="itemize2">
<li class="itemize">URL length has to be less than configuration variable maxUrlLength (section
<a
href="DocMainse9.html#x42-860009.1.15">9.1.15<!--tex4ht:ref: maxUrlLength --></a>)
</li>
<li class="itemize">Allow test: one of the Perl regular expressions in the configuration variable allow
(section <a
href="DocMainse9.html#x42-1170009.2.1">9.2.1<!--tex4ht:ref: allow --></a>) must match the URL
</li>
<li class="itemize">Exclude test: none of the Perl regular expressions in the configuration variable
exclude (section <a
href="DocMainse9.html#x42-1200009.2.4">9.2.4<!--tex4ht:ref: exclude --></a>) must match the URL
</li></ul>
<!--l. 231--><p class="noindent" >Both allow and exclude can contain two types of regular expressions identified by either
’<span
class="ectt-1095">HOST:</span>’ or ’<span
class="ectt-1095">URL</span>’ in front of the regular expression. The ’<span
class="ectt-1095">HOST:</span>’ regular expressions are
matched only against the Web-server part of the URL while the ’<span
class="ectt-1095">URL</span>’ regular expressions
are matched against the entire URL.</li></ul>
<!--l. 238--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.4 </span> <a
id="x19-290004.4"></a>Crawling strategy</h4>
<!--l. 240--><p class="noindent" >The crawler is designed to run continuously in order to keep crawled databases as up-to-date as
possible. Starting and halting crawling is done manually. The configuration variable
AutoRecycleLinks (section <a
href="DocMainse9.html#x42-730009.1.2">9.1.2<!--tex4ht:ref: AutoRecycleLinks --></a>) determines if the crawler should follow all valid new links or
just take those that already are marked for crawling.
<!--l. 247--><p class="indent" > All links from a relevant document are extracted, normalized and stored in the structured
record. Those links that pass the selection/validation criteria outlined below are marked for
crawling.
<!--l. 251--><p class="indent" >To mark a URL for crawling requires:
<ul class="itemize1">
<li class="itemize">The URL should be from a page that is relevant (i.e. pass the focus filter).
</li>
<li class="itemize">The URL scheme must be one of HTTP, HTTPS, or FTP.
</li>
<li class="itemize">The URL must not exceed the maximum length (configurable, default 250
characters).
</li>
<li class="itemize">It should pass the ’allow’ test (configurable, default all URLs passes).
</li>
<li class="itemize">It should pass the ’exclude’ test (configurable, default excludes malformed URLs,
some CGI pages, and URLs with file-extensions for binary formats).</li></ul>
<!--l. 260--><p class="indent" > At each scheduling point one URL from each available (unlocked) host is selected to generate
a ready queue, which is then processed completely before a new scheduling is done. Each selected
URL in the ready queue thus fulfills these requirements:
<ul class="itemize1">
<li class="itemize">URL must be marked for crawling (see above).
</li>
<li class="itemize">URL must be unlocked (each successful access to a URL locks it for a configurable
time WaitIntervalHarvesterLockSuccess (section <a
href="DocMainse9.html#x42-1080009.1.37">9.1.37<!--tex4ht:ref: WaitIntervalHarvesterLockSuccess --></a>)).
</li>
<li class="itemize">Host of the URL must be unlocked (each access to a host locks it for a configurable
time WaitIntervalHost (section <a
href="DocMainse9.html#x42-1100009.1.39">9.1.39<!--tex4ht:ref: WaitIntervalHost --></a>)).</li></ul>
<!--l. 271--><p class="indent" > This implements a variant of BreathFirst crawling where a page is fetched if and
only if a certain time threshold is exceeded since the last access to the server of that
page.
<!--l. 275--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.5 </span> <a
id="x19-300004.5"></a>Built-in topic filter – automated subject classification using string matching</h4>
<!--l. 277--><p class="noindent" >The built-in topic filter is an approach to automated classification, that uses a topic definition
with a pre-defined controlled vocabulary of topical terms, to determine relevance judgement.
Thus it does not rely on a particular set of seed pages, or a collection of pre-classified example
pages to learn from. It does require that some of the seed pages are relevant and contain links
into the topical area. One simple way of creating a set of seed pages would be to use terms from
the controlled vocabulary as queries for a general-purpose search engine and take the result as
seed pages.
<!--l. 287--><p class="indent" > The system for automated topic classification (overview in Figure <a
href="#x19-300013">3<!--tex4ht:ref: topicfilter --></a>), that determines topical
relevance in the topical filter, is based on matching subject terms from a controlled vocabulary in
a topic definition with the text of the document to be classified <span class="cite">[<a
href="DocMainli2.html#Xardo99:_online99">3</a>]</span>. The topic definition uses
subject classes in a hierarchical classification system (corresponding to topics) and terms
associated with each subject class. Terms can be single words, phrases, or Boolean
AND-expressions connecting terms. Boolean OR-expressions are implicitly handled
by having several different terms associated with the same subject class (see section
<a
href="#x19-310004.5.1">4.5.1<!--tex4ht:ref: termlist --></a>).
<!--l. 299--><p class="indent" > The algorithm works by string-to-string matching of terms and text in documents. Each time
a match is found the document is awarded points based on which term is matched and in which
structural part of the document (location) the match is found <span class="cite">[<a
href="DocMainli2.html#Xardo05:_ECDL">10</a>]</span>. The points are summed to
make the final relevance score of the document. If the score is above a cut-off value the
document is saved in the database together with a (list of) subject classification(s) and
term(s).
<!--l. 308--><p class="indent" > <hr class="figure"><div class="figure"
><table class="figure"><tr class="figure"><td class="figure"
>
<a
id="x19-300013"></a>
<div class="center"
>
<!--l. 309--><p class="noindent" >
<!--l. 312--><p class="noindent" ><img
src="DocMain2x.png" alt="PIC" class="graphics" width="426.79134pt" height="219.97188pt" ><!--tex4ht:graphics
name="DocMain2x.png" src="TopicFilter.xfig.eps"
--></div>
<br /> <table class="caption"
><tr style="vertical-align:baseline;" class="caption"><td class="id">Figure 3: </td><td
class="content">Overview of the automated topic classification algorithm</td></tr></table><!--tex4ht:label?: x19-300013 -->
<!--l. 316--><p class="indent" > </td></tr></table></div><hr class="endfigure">
<!--l. 318--><p class="indent" > By providing a list of known relevant sites in the configuration file <span
class="ectt-1095">sitesOK.txt </span>(located in
the job specific configuration directory) the above test can be bypassed. It works by checking
the host part of the URL against the list of known relevant sites and if a match is
found the page is validated and saved in the database regardless of the outcome of the
algorithm.
<h5 class="subsubsectionHead"><span class="titlemark">4.5.1 </span> <a
id="x19-310004.5.1"></a>Topic definition</h5>
<!--l. 326--><p class="noindent" >Located in <span
class="ectt-1095">/etc/combine/<jobname>/topicdefinition.txt</span>. Topic definitions use triplets
(term, relevance weight, topic-classes) as its basic entities. Weights are signed integers and
indicate the relevance of the term with respect to the topic-classes. Higher values indicate more
relevant terms. A large negative value can be used to exclude documents containing that term.
Terms should be all lowercase.
<!--l. 334--><p class="indent" > Terms can be:
<ul class="itemize1">
<li class="itemize">single words
</li>
<li class="itemize">a phrase (i.e. all words in exact order)
</li>
<li class="itemize">a Boolean AND-expression connecting terms (i.e. all terms must be present but in
any order). The Boolean AND operator is encoded as ’@and’.</li></ul>
<!--l. 341--><p class="noindent" >A Boolean OR-expression has to be entered as separate term triplets. The Boolean expression
“<span
class="ectt-1095">polymer AND (atactic OR syndiotactic)</span>” thus has to be translated into two separate
triplets, one containing the term “<span
class="ectt-1095">polymer @and atactic</span>”, and another with “<span
class="ectt-1095">polymer @and</span>
<span
class="ectt-1095">syndiotactic</span>”.
<!--l. 347--><p class="indent" > Terms can include (Perl) regular expressions like:
<ul class="itemize1">
<li class="itemize">a ’<span
class="ectt-1095">?</span>’ makes the character immediately preceding optional, i.e. the term “<span
class="ectt-1095">coins?</span>” will
match both “<span
class="ectt-1095">coin</span>” and “<span
class="ectt-1095">coins</span>”
</li>
<li class="itemize">a “<span
class="ectt-1095">[</span><img
src="DocMain3x.png" alt="ˆ " class="circ" ><span
class="cmsy-10x-x-109">\</span><span
class="ectt-1095">s]*</span>” is truncation (matches all character sequences except space ’ ’),<br
class="newline" />“<span
class="ectt-1095">glass art[</span><img
src="DocMain4x.png" alt="ˆ " class="circ" ><span
class="cmsy-10x-x-109">\</span><span
class="ectt-1095">s]*</span>” will match “<span
class="ectt-1095">glass art</span>”, “<span
class="ectt-1095">glass arts</span>”, “<span
class="ectt-1095">glass artists</span>”,
“<span
class="ectt-1095">glass articles</span>”, and so on.</li></ul>
<!--l. 358--><p class="indent" > It is important to understand that each triplet in the topic definition is considered by itself
without any context, so they must <span
class="ecbx-1095">each </span>be topic- or sub-class specific in order to be
useful. Subject neutral terms like “use”, “test”, “history” should not be used. If really
needed they have to be qualified so that they become topic specific (see examples
below).
<!--l. 366--><p class="indent" >Simple guidelines for creating the triplets and assigning weights are:
<ul class="itemize1">
<li class="itemize">Phrases or unique, topic-specific terms, should be used if possible, and assigned the
highest weights, since they normally are most discriminatory.
</li>
<li class="itemize">Boolean AND-expressions are the next best.
</li>
<li class="itemize">Single words can be too general and/or have several meanings or uses that make
them less specific and those should thus be assigned a small weights.
</li>
<li class="itemize">Acronyms can be used as terms if they are unique.
</li>
<li class="itemize">Negative weights should be used in order to exclude concepts.</li></ul>
<!--l. 380--><p class="noindent" >
<h5 class="subsubsectionHead"><span class="titlemark">4.5.2 </span> <a
id="x19-320004.5.2"></a>Topic definition (term triplets) BNF grammar</h5>
<!--l. 381--><p class="noindent" >TERM-LIST :== TERM-ROW ’<span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">cr</span><span
class="cmmi-10x-x-109">></span>’ <span class="obeylines-h"><span class="verb"><span
class="ectt-1095">||</span></span></span> ’<span
class="ecbx-1095">#</span>’ <span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">char</span><span
class="cmmi-10x-x-109">></span>+ ’<span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">cr</span><span
class="cmmi-10x-x-109">></span>’ <span class="obeylines-h"><span class="verb"><span
class="ectt-1095">||</span></span></span> ’<span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">cr </span><span
class="cmmi-10x-x-109">></span>’ <br
class="newline" />TERM-ROW :== WEIGHT ’<span
class="ecbx-1095">: </span>’ TERMS ’<span
class="ecbx-1095">=</span>’ CLASS-LIST <br
class="newline" />WEIGHT :== [’<span
class="ecbx-1095">-</span>’]<span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">integer</span><span
class="cmmi-10x-x-109">> </span><br
class="newline" />TERMS :== TERM [’ <span
class="ecbx-1095">@and </span>’ TERMS]* <br
class="newline" />TERM :== WORD ’ ’ [WORD]* <br
class="newline" />WORD :== <span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">char</span><span
class="cmmi-10x-x-109">></span>+<span class="obeylines-h"><span class="verb"><span
class="ectt-1095">||</span></span></span><span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">char</span><span
class="cmmi-10x-x-109">></span>+<span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">perl-reg-exp</span><span
class="cmmi-10x-x-109">> </span><br
class="newline" />CLASS-LIST :== CLASSID [’<span
class="ecbx-1095">,</span>’ CLASS-LIST] <br
class="newline" />CLASSID :== <span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">char</span><span
class="cmmi-10x-x-109">></span>+ <br
class="newline" />
<!--l. 391--><p class="indent" > A line that starts with ’#’ is ignored and so are empty lines.
<!--l. 393--><p class="indent" > <span
class="cmmi-10x-x-109"><</span><span
class="ecbx-1095">perl-reg-exp</span><span
class="cmmi-10x-x-109">> </span>is only supported by the plain matching algorithm described in section
<a
href="#x19-340004.5.4">4.5.4<!--tex4ht:ref: std --></a>.
<!--l. 396--><p class="indent" > “CLASSID” is a topic (sub-)class specifier, often from a hierarchical classification system like Engineering
Index<span class="footnote-mark"><a
href="DocMain23.html#fn16x0"><sup class="textsuperscript">16</sup></a></span><a
id="x19-32001f16"></a>.
<h5 class="subsubsectionHead"><span class="titlemark">4.5.3 </span> <a
id="x19-330004.5.3"></a>Term triplet examples</h5>
<table
class="verbatim"><tr class="verbatim"><td
class="verbatim"><div class="verbatim">
50: optical glass=A.14.5, D.2.2
 <br />30: glass @and fiberoptics=D.2.2.8
 <br />50: glass @and technical @and history=D.2
 <br />50: ceramic materials @and glass=D.2.1.7
 <br />-10000: glass @and art=A
</div>
</td></tr></table>
<!--l. 407--><p class="nopar" >
<!--l. 409--><p class="indent" > The first line says that a document containing the term “<span
class="ectt-1095">optical glass</span>” should be awarded
50 points for each of the two classes A.14.5 and D.2.2.
<!--l. 413--><p class="indent" > “<span
class="ectt-1095">glass</span>” as a single term is probably too general, qualify it with more terms like: “<span
class="ectt-1095">glass @and</span>
<span
class="ectt-1095">fiberoptics</span>” , or “<span
class="ectt-1095">glass @and technical @and history</span>” or use a phrase like “<span
class="ectt-1095">glass fiber</span>”
or “<span
class="ectt-1095">optical glass</span>”.
<!--l. 417--><p class="indent" > In order to exclude documents about artistic use of glass the term “<span
class="ectt-1095">glass @and art</span>” can be
used with a (high) negative score.
<!--l. 420--><p class="indent" > An example from the topic definition for ’Carnivorous Plants’ using regular expressions is
given below:
<table
class="verbatim"><tr class="verbatim"><td
class="verbatim"><div class="verbatim">
#This is a comment
 <br />75: d\.?\s*californica=CP.Drosophyllum
 <br />10: pitcher[^\s]*=CP
 <br />-10: pitcher[^\s]* @and baseball=CP
</div>
</td></tr></table>
<!--l. 427--><p class="nopar" > The term “<span
class="ectt-1095">d</span><span
class="cmsy-10x-x-109">\</span><span
class="ectt-1095">.?</span><span
class="cmsy-10x-x-109">\</span><span
class="ectt-1095">s*californica</span>” will match <span
class="ectt-1095">D californica, D. californica,</span>
<span
class="ectt-1095">D.californica </span>etc.
<!--l. 431--><p class="indent" > The last two lines assure that a document containing “<span
class="ectt-1095">pitcher</span>” gets 10 points but if the
document also contains “<span
class="ectt-1095">baseball</span>” the points are removed.
<!--l. 434--><p class="noindent" >
<h5 class="subsubsectionHead"><span class="titlemark">4.5.4 </span> <a
id="x19-340004.5.4"></a>Algorithm 1: plain matching</h5>
<!--l. 437--><p class="noindent" >This algorithm is selected by setting the configuration parameter<br
class="newline" /><span class="obeylines-h"><span class="verb"><span
class="ectt-1095"> </span><span
class="ectt-1095"> </span><span
class="ectt-1095"> </span><span
class="ectt-1095"> classifyPlugIn</span><span
class="ectt-1095"> =</span><span
class="ectt-1095"> Combine::Check_record</span></span></span>
<!--l. 440--><p class="indent" > The algorithm produces a list of suggested topic-classes (subject classifications) and
corresponding relevance scores using the algorithm:<center class="par-math-display" >
<img
src="DocMain5x.png" alt="Relevance_score =
" class="par-math-display" ></center>
<!--l. 444--><p class="nopar" >
<center class="math-display" >
<img
src="DocMain6x.png" alt=" ( )
∑ ∑
( (hits[locationj][term i]* weight[termi ]*weight [locationj]))
all locations all terms
" class="math-display" ></center>
<!--l. 446--><p class="nopar" >
<!--l. 448--><p class="indent" >
<dl class="description"><dt class="description">
<span
class="ecbx-1095">term weight</span> </dt><dd
class="description">(<span
class="cmmi-10x-x-109">weight</span><span
class="cmr-10x-x-109">[</span>term<sub><span
class="cmmi-8">i</span></sub><span
class="cmr-10x-x-109">]</span>) is taken from the topic definition triplets.
</dd><dt class="description">
<span
class="ecbx-1095">location weight</span> </dt><dd
class="description">(<span
class="cmmi-10x-x-109">weight</span><span
class="cmr-10x-x-109">[</span>location<sub><span
class="cmmi-8">j</span></sub><span
class="cmr-10x-x-109">]</span>) are defined ad hoc for locations like title, metadata,
HTML headings, and plain text. However the exact values for these weights does not
seem to play a large role in the precision of the algorithm <span class="cite">[<a
href="DocMainli2.html#Xardo05:_ECDL">10</a>]</span>.
</dd><dt class="description">
<span
class="ecbx-1095">hits</span> </dt><dd
class="description">(<span
class="cmmi-10x-x-109">hits</span><span
class="cmr-10x-x-109">[</span>location<sub><span
class="cmmi-8">j</span></sub><span
class="cmr-10x-x-109">][</span>term<sub><span
class="cmmi-8">i</span></sub><span
class="cmr-10x-x-109">]</span>) is the number of times term<sub><span
class="cmmi-8">i</span></sub> occur in the text of location<sub><span
class="cmmi-8">j</span></sub></dd></dl>
<!--l. 459--><p class="indent" > The summed relevance score might, for certain applications, have to be normalized with
respect to text size of the document.
<!--l. 462--><p class="indent" > One problem with this algorithm is that a term that is found in the beginning of the text
contributes as much as a term that is found at the end of a large document. Another
problem is the distance and thus the coupling between two terms in a Boolean expression
might be very large in a big document and this is not taken into account by the above
algorithm.
<!--l. 469--><p class="noindent" >
<h5 class="subsubsectionHead"><span class="titlemark">4.5.5 </span> <a
id="x19-350004.5.5"></a>Algorithm 2: position weighted matching</h5>
<!--l. 471--><p class="noindent" >This algorithm is selected by setting the configuration parameter<br
class="newline" /><span class="obeylines-h"><span class="verb"><span
class="ectt-1095"> </span><span
class="ectt-1095"> </span><span
class="ectt-1095"> </span><span
class="ectt-1095"> classifyPlugIn</span><span
class="ectt-1095"> =</span><span
class="ectt-1095"> Combine::PosCheck_record</span></span></span>
<!--l. 474--><p class="indent" > In response to the problems cited above we developed a modified version of the algorithm
that takes into account word position in the text and proximity for Boolean terms. It also
eliminates the need to assign ad hoc weights to locations. The new algorithm works as
follows.
<!--l. 480--><p class="indent" > First all text from all locations are concatenated (in the natural importance order title,
metadata, text) into one chunk of text. Matching of terms is done against this chunk. Relevance
score is calculated as<center class="par-math-display" >
<img
src="DocMain7x.png" alt="Relevance_score =
" class="par-math-display" ></center>
<!--l. 485--><p class="nopar" >
<center class="math-display" >
<img
src="DocMain8x.png" alt=" ( )
∑ ∑ weight[termi]
( ------------------------------------------------------)
all terms all matches log(k * position[termi ][matchj])* proximity[termi ][matchj]
" class="math-display" ></center>
<!--l. 487--><p class="nopar" >
<!--l. 489--><p class="indent" >
<dl class="description"><dt class="description">
<span
class="ecbx-1095">term weight</span> </dt><dd
class="description">(<span
class="cmmi-10x-x-109">weight</span><span
class="cmr-10x-x-109">[</span>term<sub><span
class="cmmi-8">i</span></sub><span
class="cmr-10x-x-109">]</span>) is taken from the topic definition triplets
</dd><dt class="description">
<span
class="ecbx-1095">position</span> </dt><dd
class="description">(<span
class="cmmi-10x-x-109">position</span><span
class="cmr-10x-x-109">[</span>term<sub><span
class="cmmi-8">i</span></sub><span
class="cmr-10x-x-109">][</span>match<sub><span
class="cmmi-8">j</span></sub><span
class="cmr-10x-x-109">]</span>) is the position in the text (starting from 1) for match<sub><span
class="cmmi-8">j</span></sub>
of term<sub><span
class="cmmi-8">i</span></sub>. The constant factor <span
class="cmmi-10x-x-109">k </span>is normally <span
class="cmr-10x-x-109">0</span><span
class="cmmi-10x-x-109">.</span><span
class="cmr-10x-x-109">5</span>
</dd><dt class="description">
<span
class="ecbx-1095">proximity</span> </dt><dd
class="description">(<span
class="cmmi-10x-x-109">proximity</span><span
class="cmr-10x-x-109">[</span>term<sub><span
class="cmmi-8">i</span></sub><span
class="cmr-10x-x-109">][</span>match<sub><span
class="cmmi-8">j</span></sub><span
class="cmr-10x-x-109">]</span>) is
<div class="tabular"> <table class="tabular"
cellspacing="0" cellpadding="0"
><colgroup id="TBL-2-1g"><col
id="TBL-2-1"><col
id="TBL-2-2"></colgroup><tr
style="vertical-align:baseline;" id="TBL-2-1-"><td style="white-space:nowrap; text-align:center;" id="TBL-2-1-1"
class="td11"> 1 </td><td style="white-space:nowrap; text-align:left;" id="TBL-2-1-2"
class="td11">for non Boolean terms</td>
</tr><tr
style="vertical-align:baseline;" id="TBL-2-2-"><td style="white-space:nowrap; text-align:center;" id="TBL-2-2-1"
class="td11"><span
class="cmr-10x-x-109">log</span><span
class="cmr-10x-x-109">(</span><span
class="cmmi-10x-x-109">distance</span>_<span
class="cmmi-10x-x-109">between</span>_<span
class="cmmi-10x-x-109">components</span><span
class="cmr-10x-x-109">)</span></td><td style="white-space:nowrap; text-align:left;" id="TBL-2-2-2"
class="td11">for Boolean terms </td>
</tr><tr
style="vertical-align:baseline;" id="TBL-2-3-"><td style="white-space:nowrap; text-align:center;" id="TBL-2-3-1"
class="td11"> </td>
</tr></table></div></dd></dl>
<!--l. 505--><p class="indent" > In this algorithm a matched term close to the start of text contributes more to the relevance
score than a match towards the end of the text. And for Boolean terms the closer the
components are the higher the contribution to the relevance score.
<!--l. 510--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.6 </span> <a
id="x19-360004.6"></a>Built-in topic filter – automated subject classification using SVM</h4>
<!--l. 511--><p class="noindent" >Topic filetring using SVM (Support Vector Machines) classifiers are supported using the SVMLight
package<span class="footnote-mark"><a
href="DocMain24.html#fn17x0"><sup class="textsuperscript">17</sup></a></span><a
id="x19-36001f17"></a>.
This package has to be installed manually together with the
Algorithm::SVMLight Perl module. For installation hints see CPAN SVMLight
README<span class="footnote-mark"><a
href="DocMain25.html#fn18x0"><sup class="textsuperscript">18</sup></a></span><a
id="x19-36002f18"></a> or
’installing-algorithm-svmlight-linux-ubuntu<span class="footnote-mark"><a
href="DocMain26.html#fn19x0"><sup class="textsuperscript">19</sup></a></span><a
id="x19-36003f19"></a>’
<!--l. 517--><p class="indent" > SVM classifiers need a trained model before they can be used.
<!--l. 519--><p class="indent" > The procedure to get started is as follows:
<ul class="itemize1">
<li class="itemize">Make sure that Algorithm::SVMLight<span class="footnote-mark"><a
href="DocMain27.html#fn20x0"><sup class="textsuperscript">20</sup></a></span><a
id="x19-36004f20"></a>
and SVMLight<span class="footnote-mark"><a
href="DocMain28.html#fn21x0"><sup class="textsuperscript">21</sup></a></span><a
id="x19-36005f21"></a>
are installed.
</li>
<li class="itemize">Collect examples of good and bad URLs that defines your topic (the more the better).
</li>
<li class="itemize">Generate a SVM model with the program <span
class="ectt-1095">combineSVM</span>.
</li>
<li class="itemize">Initialize a new job with <span
class="ectt-1095">combineINIT</span>.
</li>
<li class="itemize">Copy the SVM model to the job’s configuration directory
<span
class="ectt-1095">/etc/combine/<jobname>/SVMmodel.txt</span>.
</li>
<li class="itemize">Edit the configuration file <span
class="ectt-1095">/etc/combine/<jobname>/combine.cfg </span>and add the
following:
<table
class="verbatim"><tr class="verbatim"><td
class="verbatim"><div class="verbatim">
doCheckRecord = 1
 <br />classifyPlugIn = Combine::classifySVM
 <br />SVMmodel = SVMmodel.txt
</div>
</td></tr></table>
<!--l. 531--><p class="nopar" >
</li>
<li class="itemize">Then proceed with crawling as normal.</li></ul>
<h4 class="subsectionHead"><span class="titlemark">4.7 </span> <a
id="x19-370004.7"></a>Topic filter Plug-In API</h4>
<!--l. 536--><p class="noindent" >The configuration variable classifyPlugIn (section <a
href="DocMainse9.html#x42-750009.1.4">9.1.4<!--tex4ht:ref: classifyPlugIn --></a>) is used to find the Perl module that
implements the desired topic filter. The value should be formatted as a valid Perl module
identifier (i.e. the module must be somewhere in the Perl module search path). Combine will call
a subroutine named ’<span class="obeylines-h"><span class="verb"><span
class="ectt-1095">classify</span></span></span>’ in this module, providing an XWI-object as in parameter. An
XWI-object is a structured object holding all information from parsing a Web-page. The
subroutine must return either 0 or 1, where<br
class="newline" />  0: means record fails to meet the classification criteria, i.e. ignore this record<br
class="newline" />  1: means record is OK, store it in the database, and follow the links
<!--l. 548--><p class="indent" > More details on how to write a Plug-In can be found in the example classifyPlugInTemplate.pm
(see Appendix <a
href="DocMainse11.html#x45-196000A.2">A.2<!--tex4ht:ref: classifyPlugInTemplate --></a>).
<!--l. 551--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.8 </span> <a
id="x19-380004.8"></a>Analysis</h4>
<!--l. 553--><p class="noindent" >Extra analysis, enabled by the configuration variable doAnalyse (section <a
href="DocMainse9.html#x42-770009.1.6">9.1.6<!--tex4ht:ref: doAnalyse --></a>), tries to
determine the language of the content and the country of the Web-server. Both are stored in the
internal database.
<!--l. 558--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.9 </span> <a
id="x19-390004.9"></a>Duplicate detection</h4>
<!--l. 559--><p class="noindent" >Duplicates of crawled documents are automatically detected with the aid of a MD5-checksum
calculated on the contents of the document.
<!--l. 562--><p class="indent" > The MD5-checksum is used as the master record key in the internal database thus preventing
pollution with duplicate pages. All URLs for a page are stored in the record, and a page is not
deleted from the database until the crawler has verified that it’s unavailable from all the saved
URLs.
<!--l. 568--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.10 </span> <a
id="x19-400004.10"></a>URL recycling</h4>
<!--l. 569--><p class="noindent" >URLs for recycling come from 3 sources:
<ul class="itemize1">
<li class="itemize">Links extracted during HTML parsing.
</li>
<li class="itemize">Redirects (unless configuration variable UserAgentFollowRedirects (section <a
href="DocMainse9.html#x42-1010009.1.30">9.1.30<!--tex4ht:ref: UserAgentFollowRedirects --></a>) is
set).
</li>
<li class="itemize">URLs extracted from plain text (enabled by the configuration variable
extractLinksFromText (section <a
href="DocMainse9.html#x42-800009.1.9">9.1.9<!--tex4ht:ref: extractLinksFromText --></a>)).</li></ul>
<!--l. 577--><p class="indent" > Automatic recycling of URLs is enabled by the configuration variable AutoRecycleLinks
(section <a
href="DocMainse9.html#x42-730009.1.2">9.1.2<!--tex4ht:ref: AutoRecycleLinks --></a>). It can also be done manually with the command<br
class="newline" /><span class="obeylines-h"><span class="verb"><span
class="ectt-1095">combineCtrl</span><span
class="ectt-1095"> --jobname</span><span
class="ectt-1095"> XXXX</span><span
class="ectt-1095"> recyclelinks</span></span></span>
<!--l. 582--><p class="indent" > The command <span class="obeylines-h"><span class="verb"><span
class="ectt-1095">combineCtrl</span><span
class="ectt-1095"> --jobname</span><span
class="ectt-1095"> XXXX</span><span
class="ectt-1095"> reharvest</span></span></span> marks all pages in the database for
harvesting again.
<!--l. 585--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.11 </span> <a
id="x19-410004.11"></a>Database cleaning</h4>
<!--l. 587--><p class="noindent" >The tool <span
class="ectt-1095">combineUtil </span>implements functionality for cleaning the database.
<!--l. 589--><p class="indent" >
<dl class="description"><dt class="description">
<span
class="ecbx-1095">sanity/restoreSanity</span> </dt><dd
class="description">checks respectively restore consistency of the internal database.
</dd><dt class="description">
<span
class="ecbx-1095">deleteNetLoc/deletePath/deleteMD5/deleteRecordid</span> </dt><dd
class="description">deletes records from the
database based on supplied parameters.
</dd><dt class="description">
<span
class="ecbx-1095">serverAlias</span> </dt><dd
class="description">detects Web-server aliases in the database. All detected alias groups are
added to the serveralias configuration (section <a
href="DocMainse9.html#x42-1210009.2.5">9.2.5<!--tex4ht:ref: serveralias --></a>). Records from aliased servers
(except for the first Web-server) will be deleted.</dd></dl>
<!--l. 601--><p class="noindent" >
<h4 class="subsectionHead"><span class="titlemark">4.12 </span> <a
id="x19-420004.12"></a>Complete application – SearchEngine in a Box</h4>
<!--l. 603--><p class="noindent" >The SearchEngine-in-a-Box<span class="footnote-mark"><a
href="DocMain29.html#fn22x0"><sup class="textsuperscript">22</sup></a></span><a
id="x19-42001f22"></a>
system is based on the two systems Combine Focused Crawler and Zebra text indexing and retrieval
engine<span class="footnote-mark"><a
href="DocMain30.html#fn23x0"><sup class="textsuperscript">23</sup></a></span><a
id="x19-42002f23"></a>.
This system allows you build a vertical search engine for your favorite topic in a few easy
steps.
<!--l. 611--><p class="indent" > The SearchEngine-in-a-Box Web-site contains instructions and downloads to make this
happen. Basically it makes use of the ZebraHost (see section <a
href="DocMainse9.html#x42-1150009.1.44">9.1.44<!--tex4ht:ref: ZebraHost --></a>) configuration variable which
enables direct communication between the crawler and the database system and thus
indexes records as soon as they are crawled. This also means that they are directly
searchable.
<!--l. 1--><div class="crosslinks"><p class="noindent">[<a
href="DocMainse3.html" >prev</a>] [<a
href="DocMainse3.html#tailDocMainse3.html" >prev-tail</a>] [<a
href="DocMainse4.html" >front</a>] [<a
href="DocMainpa1.html# " >up</a>] </p></div>
<!--l. 1--><p class="indent" > <a
id="tailDocMainse4.html"></a>
</body></html>