The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"  
  "http://www.w3.org/TR/html4/loose.dtd">  
<html > 
<head><title>Crawler internal operation</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
<meta name="generator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)"> 
<meta name="originator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)"> 
<!-- html,2 --> 
<meta name="src" content="DocMain.tex"> 
<meta name="date" content="2009-06-16 09:20:00"> 
<link rel="stylesheet" type="text/css" href="DocMain.css"> 
</head><body 
>
   <!--l. 1--><div class="crosslinks"><p class="noindent">[<a 
href="DocMainse3.html" >prev</a>] [<a 
href="DocMainse3.html#tailDocMainse3.html" >prev-tail</a>] [<a 
href="#tailDocMainse4.html">tail</a>] [<a 
href="DocMainpa1.html# " >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">4   </span> <a 
 id="x19-250004"></a>Crawler internal operation</h3>
<!--l. 3--><p class="noindent" >The system is designed for continuous operation. The harvester processes a URL in several
steps as detailed in Figure <a 
href="#x19-250032">2<!--tex4ht:ref: combinearch --></a>. As a start-up initialization the frontier has to be seeded
with some relevant URLs. All URLs are normalized before they are entered in the
database. Data can be exported in various formats including the ALVIS XML document
format<span class="footnote-mark"><a 
href="DocMain20.html#fn13x0"><sup class="textsuperscript">13</sup></a></span><a 
 id="x19-25001f13"></a> and
Dublin Core<span class="footnote-mark"><a 
href="DocMain21.html#fn14x0"><sup class="textsuperscript">14</sup></a></span><a 
 id="x19-25002f14"></a>
records.
<!--l. 12--><p class="indent" >   <hr class="figure"><div class="figure" 
><table class="figure"><tr class="figure"><td class="figure" 
>

<a 
 id="x19-250032"></a>

<div class="center" 
>
<!--l. 14--><p class="noindent" >
<!--l. 15--><p class="noindent" ><img 
src="DocMain1x.png" alt="PIC" class="graphics" width="192.84303pt" height="392.65201pt" ><!--tex4ht:graphics  
name="DocMain1x.png" src="CrawlerArchitecture.xfig.eps"  
--></div>
<br /> <table class="caption" 
><tr style="vertical-align:baseline;" class="caption"><td class="id">Figure&#x00A0;2: </td><td  
class="content">Architecture for the Combine focused crawler.</td></tr></table><!--tex4ht:label?: x19-250032 -->

<!--l. 20--><p class="indent" >   </td></tr></table></div><hr class="endfigure">
<!--l. 22--><p class="indent" >   The steps taken during crawling (numbers refer to Figure <a 
href="#x19-250032">2<!--tex4ht:ref: combinearch --></a>):
     <ol  class="enumerate1" >
     <li 
  class="enumerate" id="x19-25005x1">The next URL is fetched from the scheduler.
     </li>
     <li 
  class="enumerate" id="x19-25007x2">Combine obeys the Robots Exclusion Protocol<span class="footnote-mark"><a 
href="DocMain22.html#fn15x0"><sup class="textsuperscript">15</sup></a></span><a 
 id="x19-25008f15"></a>.
     Rules are cached locally.
     </li>
     <li 
  class="enumerate" id="x19-25010x3">The page is retrieved using a GET, GET-IF-MODIFIED, or HEAD HTTP request.
     </li>
     <li 
  class="enumerate" id="x19-25012x4">The HTML code is cleaned and normalized.
     </li>
     <li 
  class="enumerate" id="x19-25014x5">The character-set is detected and normalized to UTF-8.
     </li>
     <li 
  class="enumerate" id="x19-25016x6">
          <ol  class="enumerate2" >
          <li 
  class="enumerate" id="x19-25018x1">The  page  (in  any  of  the  formats  PDF,  PostScript,  MSWord,  MSExcel,
          MSPowerPoint, RTF and TeX/LaTeX) is converted to HTML or plain text by
          an external program.
          </li>
          <li 
  class="enumerate" id="x19-25020x2">Internal  parsers  handles  HTML,  plain  text  and  images.  This  step  extracts
          structured information like metadata (title, keywords, description ...), HTML
          links, and text without markup.</li></ol>
     </li>
     <li 
  class="enumerate" id="x19-25022x7">The document is sent to the topic filter (see section <a 
href="#x19-300004.5">4.5<!--tex4ht:ref: autoclass --></a>). If the Web-page is relevant with
     respect to the focus topic, processing continues with:
          <ol  class="enumerate2" >
          <li 
  class="enumerate" id="x19-25024x1">Heuristics like score propagation.
          </li>
          <li 
  class="enumerate" id="x19-25026x2">Further analysis, like genre and language identification.
          </li>
          <li 
  class="enumerate" id="x19-25028x3">Updating the record database.
          </li>
          <li 
  class="enumerate" id="x19-25030x4">Updating the frontier database with HTML links and URLs extracted from
          plain text.
          </li></ol>
     </li></ol>
<!--l. 96--><p class="indent" >   Depending on several factors like configuration, hardware, network, workload, the crawler
normally processes between 50 and 200 URLs per minute.

   <h4 class="subsectionHead"><span class="titlemark">4.1   </span> <a 
 id="x19-260004.1"></a>URL selection criteria</h4>
<!--l. 103--><p class="noindent" >In order to successfully select and crawl one URL the following conditions (in this order) have to
be met:
     <ol  class="enumerate1" >
     <li 
  class="enumerate" id="x19-26002x1">The URL has to be selected by the scheduling algorithm (section <a 
href="#x19-290004.4">4.4<!--tex4ht:ref: sched --></a>).<br 
class="newline" />
     <!--l. 109--><p class="noindent" ><span 
class="ecti-1095">Relevant                                                                             configuration</span>
     <span 
class="ecti-1095">variables: </span>WaitIntervalHost (section <a 
href="DocMainse9.html#x42-1100009.1.39">9.1.39<!--tex4ht:ref: WaitIntervalHost --></a>), WaitIntervalHarvesterLockRobotRules
     (section <a 
href="DocMainse9.html#x42-1070009.1.36">9.1.36<!--tex4ht:ref: WaitIntervalHarvesterLockRobotRules --></a>), WaitIntervalHarvesterLockSuccess (section <a 
href="DocMainse9.html#x42-1080009.1.37">9.1.37<!--tex4ht:ref: WaitIntervalHarvesterLockSuccess --></a>)
     </li>
     <li 
  class="enumerate" id="x19-26004x2">The URL has to pass the allow test.
     <!--l. 116--><p class="noindent" ><span 
class="ecti-1095">Relevant configuration variables: </span>allow (section <a 
href="DocMainse9.html#x42-1170009.2.1">9.2.1<!--tex4ht:ref: allow --></a>)
     </li>
     <li 
  class="enumerate" id="x19-26006x3">The URL is not be excluded by the exclude test (see section <a 
href="#x19-280004.3">4.3<!--tex4ht:ref: urlfilt --></a>).
     <!--l. 121--><p class="noindent" ><span 
class="ecti-1095">Relevant configuration variables: </span>exclude (section <a 
href="DocMainse9.html#x42-1200009.2.4">9.2.4<!--tex4ht:ref: exclude --></a>)
     </li>
     <li 
  class="enumerate" id="x19-26008x4">The Robot Exclusion Protocol has to allow crawling of the URL.
     </li>
     <li 
  class="enumerate" id="x19-26010x5">Optionally the document at the URL location has to pass the topic filter (section
     <a 
href="#x19-300004.5">4.5<!--tex4ht:ref: autoclass --></a>).
     <!--l. 129--><p class="noindent" ><span 
class="ecti-1095">Relevant  configuration  variables:  </span>classifyPlugIn  (section  <a 
href="DocMainse9.html#x42-750009.1.4">9.1.4<!--tex4ht:ref: classifyPlugIn --></a>),  doCheckRecord
     (section <a 
href="DocMainse9.html#x42-780009.1.7">9.1.7<!--tex4ht:ref: doCheckRecord --></a>).
     </li></ol>
<!--l. 135--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.2   </span> <a 
 id="x19-270004.2"></a>Document parsing and information extraction</h4>
<!--l. 137--><p class="noindent" >Each document is parsed and analyzed by the crawler in order to store structured document
records in the internal MySQL database. The structure of the record includes the
fields:
     <ul class="itemize1">
     <li class="itemize">Title
     </li>
     <li class="itemize">Headings
     </li>
     <li class="itemize">Metadata
     </li>
     <li class="itemize">Plain text

     </li>
     <li class="itemize">Original document
     </li>
     <li class="itemize">Links &#8211; HTML and plain text URLs
     </li>
     <li class="itemize">Link anchor text
     </li>
     <li class="itemize">Mime-Type
     </li>
     <li class="itemize">Dates &#8211; modification, expire, and last checked by crawler
     </li>
     <li class="itemize">Web-server identification</li></ul>
<!--l. 153--><p class="indent" >   Optional some extra analysis can be done, see section <a 
href="#x19-380004.8">4.8<!--tex4ht:ref: analysis --></a>.
<!--l. 155--><p class="indent" >   The system selects a document parser based on the Mime-Type together with available
parsers and converter programs.
     <ol  class="enumerate1" >
     <li 
  class="enumerate" id="x19-27002x1">For some mime-types an external program is called in order to convert the document
     to a format handled internally (HTML or plain text).
     <!--l. 161--><p class="noindent" ><span 
class="ecti-1095">Relevant configuration variables: </span>converters (section <a 
href="DocMainse9.html#x42-1190009.2.3">9.2.3<!--tex4ht:ref: converters --></a>)
     </li>
     <li 
  class="enumerate" id="x19-27004x2">Internal parsers handle HTML, plain text, TeX, and Image.
     <!--l. 165--><p class="noindent" ><span 
class="ecti-1095">Relevant configuration variables: </span>converters (section <a 
href="DocMainse9.html#x42-1190009.2.3">9.2.3<!--tex4ht:ref: converters --></a>)
     </li></ol>
<!--l. 169--><p class="indent" >   Supporting a new document format is as easy as providing a program that can convert a
document in this format to HTML or plain text. Configuration of the mapping between
document format (Mime-Type) and converter program is done in the complex configuration
variable &#8217;converters&#8217; (section <a 
href="DocMainse9.html#x42-1190009.2.3">9.2.3<!--tex4ht:ref: converters --></a>).
<!--l. 173--><p class="indent" >   Out of the box Combine handle the following document formats: plain text, HTML, PDF,
PostScript, MSWord, MSPowerPoint, MSExcel, RTF, TeX, and images.
<!--l. 190--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.3   </span> <a 
 id="x19-280004.3"></a>URL filtering</h4>
<!--l. 192--><p class="noindent" >Before an URL is accepted for scheduling (either by manual loading or recycling) it is normalized
and validated. This process comprises a number of steps:
     <ul class="itemize1">
     <li class="itemize">Normalization
          <ul class="itemize2">
          <li class="itemize">General practice: host-name lowercasing, port-number substitution, canonical
          URL

          </li>
          <li class="itemize">Removing fragments (ie &#8217;#&#8217; and everything after that)
          </li>
          <li class="itemize">Cleaning CGI repetitions of parameters
          </li>
          <li class="itemize">Collapsing dots (&#8217;./&#8217;, &#8217;../&#8217;) in the path
          </li>
          <li class="itemize">Removing CGI parameters that are session ids, as identified by patterns in the
          configuration variable sessionids (section <a 
href="DocMainse9.html#x42-1220009.2.6">9.2.6<!--tex4ht:ref: sessionids --></a>)
          </li>
          <li class="itemize">Normalizing Web-server names by resolving aliases. Identified by patterns in
          the  configuration  variable  serveralias  (section  <a 
href="DocMainse9.html#x42-1210009.2.5">9.2.5<!--tex4ht:ref: serveralias --></a>).  These  patterns  can  be
          generated by using the program <span 
class="ectt-1095">combineUtil </span>to analyze a crawled corpus.</li></ul>
     </li>
     <li class="itemize">Validation: A URL has to pass all three validation steps outlined below.
          <ul class="itemize2">
          <li class="itemize">URL length has to be less than configuration variable maxUrlLength (section
          <a 
href="DocMainse9.html#x42-860009.1.15">9.1.15<!--tex4ht:ref: maxUrlLength --></a>)
          </li>
          <li class="itemize">Allow test: one of the Perl regular expressions in the configuration variable allow
          (section <a 
href="DocMainse9.html#x42-1170009.2.1">9.2.1<!--tex4ht:ref: allow --></a>) must match the URL
          </li>
          <li class="itemize">Exclude test: none of the Perl regular expressions in the configuration variable
          exclude (section <a 
href="DocMainse9.html#x42-1200009.2.4">9.2.4<!--tex4ht:ref: exclude --></a>) must match the URL
          </li></ul>
     <!--l. 231--><p class="noindent" >Both allow and exclude can contain two types of regular expressions identified by either
     &#8217;<span 
class="ectt-1095">HOST:</span>&#8217; or &#8217;<span 
class="ectt-1095">URL</span>&#8217; in front of the regular expression. The &#8217;<span 
class="ectt-1095">HOST:</span>&#8217; regular expressions are
     matched only against the Web-server part of the URL while the &#8217;<span 
class="ectt-1095">URL</span>&#8217; regular expressions
     are matched against the entire URL.</li></ul>
<!--l. 238--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.4   </span> <a 
 id="x19-290004.4"></a>Crawling strategy</h4>
<!--l. 240--><p class="noindent" >The crawler is designed to run continuously in order to keep crawled databases as up-to-date as
possible. Starting and halting crawling is done manually. The configuration variable
AutoRecycleLinks (section <a 
href="DocMainse9.html#x42-730009.1.2">9.1.2<!--tex4ht:ref: AutoRecycleLinks --></a>) determines if the crawler should follow all valid new links or
just take those that already are marked for crawling.
<!--l. 247--><p class="indent" >   All links from a relevant document are extracted, normalized and stored in the structured
record. Those links that pass the selection/validation criteria outlined below are marked for
crawling.
<!--l. 251--><p class="indent" >To mark a URL for crawling requires:
     <ul class="itemize1">
     <li class="itemize">The URL should be from a page that is relevant (i.e. pass the focus filter).

     </li>
     <li class="itemize">The URL scheme must be one of HTTP, HTTPS, or FTP.
     </li>
     <li class="itemize">The  URL  must  not  exceed  the  maximum  length  (configurable,  default  250
     characters).
     </li>
     <li class="itemize">It should pass the &#8217;allow&#8217; test (configurable, default all URLs passes).
     </li>
     <li class="itemize">It should pass the &#8217;exclude&#8217; test (configurable, default excludes malformed URLs,
     some CGI pages, and URLs with file-extensions for binary formats).</li></ul>
<!--l. 260--><p class="indent" >   At each scheduling point one URL from each available (unlocked) host is selected to generate
a ready queue, which is then processed completely before a new scheduling is done. Each selected
URL in the ready queue thus fulfills these requirements:
     <ul class="itemize1">
     <li class="itemize">URL must be marked for crawling (see above).
     </li>
     <li class="itemize">URL must be unlocked (each successful access to a URL locks it for a configurable
     time WaitIntervalHarvesterLockSuccess (section <a 
href="DocMainse9.html#x42-1080009.1.37">9.1.37<!--tex4ht:ref: WaitIntervalHarvesterLockSuccess --></a>)).
     </li>
     <li class="itemize">Host of the URL must be unlocked (each access to a host locks it for a configurable
     time WaitIntervalHost (section <a 
href="DocMainse9.html#x42-1100009.1.39">9.1.39<!--tex4ht:ref: WaitIntervalHost --></a>)).</li></ul>
<!--l. 271--><p class="indent" >   This implements a variant of BreathFirst crawling where a page is fetched if and
only if a certain time threshold is exceeded since the last access to the server of that
page.
<!--l. 275--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.5   </span> <a 
 id="x19-300004.5"></a>Built-in topic filter &#8211; automated subject classification using string matching</h4>
<!--l. 277--><p class="noindent" >The built-in topic filter is an approach to automated classification, that uses a topic definition
with a pre-defined controlled vocabulary of topical terms, to determine relevance judgement.
Thus it does not rely on a particular set of seed pages, or a collection of pre-classified example
pages to learn from. It does require that some of the seed pages are relevant and contain links
into the topical area. One simple way of creating a set of seed pages would be to use terms from
the controlled vocabulary as queries for a general-purpose search engine and take the result as
seed pages.
<!--l. 287--><p class="indent" >   The system for automated topic classification (overview in Figure <a 
href="#x19-300013">3<!--tex4ht:ref: topicfilter --></a>), that determines topical
relevance in the topical filter, is based on matching subject terms from a controlled vocabulary in
a topic definition with the text of the document to be classified <span class="cite">[<a 
href="DocMainli2.html#Xardo99:_online99">3</a>]</span>. The topic definition uses
subject classes in a hierarchical classification system (corresponding to topics) and terms
associated with each subject class. Terms can be single words, phrases, or Boolean
AND-expressions connecting terms. Boolean OR-expressions are implicitly handled
by having several different terms associated with the same subject class (see section
<a 
href="#x19-310004.5.1">4.5.1<!--tex4ht:ref: termlist --></a>).

<!--l. 299--><p class="indent" >   The algorithm works by string-to-string matching of terms and text in documents. Each time
a match is found the document is awarded points based on which term is matched and in which
structural part of the document (location) the match is found <span class="cite">[<a 
href="DocMainli2.html#Xardo05:_ECDL">10</a>]</span>. The points are summed to
make the final relevance score of the document. If the score is above a cut-off value the
document is saved in the database together with a (list of) subject classification(s) and
term(s).
<!--l. 308--><p class="indent" >   <hr class="figure"><div class="figure" 
><table class="figure"><tr class="figure"><td class="figure" 
>

<a 
 id="x19-300013"></a>

<div class="center" 
>
<!--l. 309--><p class="noindent" >
<!--l. 312--><p class="noindent" ><img 
src="DocMain2x.png" alt="PIC" class="graphics" width="426.79134pt" height="219.97188pt" ><!--tex4ht:graphics  
name="DocMain2x.png" src="TopicFilter.xfig.eps"  
--></div>
<br /> <table class="caption" 
><tr style="vertical-align:baseline;" class="caption"><td class="id">Figure&#x00A0;3: </td><td  
class="content">Overview of the automated topic classification algorithm</td></tr></table><!--tex4ht:label?: x19-300013 -->

<!--l. 316--><p class="indent" >   </td></tr></table></div><hr class="endfigure">
<!--l. 318--><p class="indent" >   By providing a list of known relevant sites in the configuration file <span 
class="ectt-1095">sitesOK.txt </span>(located in
the job specific configuration directory) the above test can be bypassed. It works by checking
the host part of the URL against the list of known relevant sites and if a match is
found the page is validated and saved in the database regardless of the outcome of the
algorithm.
   <h5 class="subsubsectionHead"><span class="titlemark">4.5.1   </span> <a 
 id="x19-310004.5.1"></a>Topic definition</h5>
<!--l. 326--><p class="noindent" >Located in <span 
class="ectt-1095">/etc/combine/&#x003C;jobname&#x003E;/topicdefinition.txt</span>. Topic definitions use triplets
(term, relevance weight, topic-classes) as its basic entities. Weights are signed integers and
indicate the relevance of the term with respect to the topic-classes. Higher values indicate more
relevant terms. A large negative value can be used to exclude documents containing that term.
Terms should be all lowercase.
<!--l. 334--><p class="indent" >   Terms can be:
     <ul class="itemize1">
     <li class="itemize">single words
     </li>
     <li class="itemize">a phrase (i.e. all words in exact order)
     </li>
     <li class="itemize">a Boolean AND-expression connecting terms (i.e. all terms must be present but in
     any order). The Boolean AND operator is encoded as &#8217;@and&#8217;.</li></ul>
<!--l. 341--><p class="noindent" >A Boolean OR-expression has to be entered as separate term triplets. The Boolean expression
&#8220;<span 
class="ectt-1095">polymer AND (atactic OR syndiotactic)</span>&#8221; thus has to be translated into two separate
triplets, one containing the term &#8220;<span 
class="ectt-1095">polymer @and atactic</span>&#8221;, and another with &#8220;<span 
class="ectt-1095">polymer @and</span>
<span 
class="ectt-1095">syndiotactic</span>&#8221;.
<!--l. 347--><p class="indent" >   Terms can include (Perl) regular expressions like:
     <ul class="itemize1">
     <li class="itemize">a &#8217;<span 
class="ectt-1095">?</span>&#8217; makes the character immediately preceding optional, i.e. the term &#8220;<span 
class="ectt-1095">coins?</span>&#8221; will
     match both &#8220;<span 
class="ectt-1095">coin</span>&#8221; and &#8220;<span 
class="ectt-1095">coins</span>&#8221;
     </li>
     <li class="itemize">a &#8220;<span 
class="ectt-1095">[</span><img 
src="DocMain3x.png" alt="&#x02C6;  "  class="circ" ><span 
class="cmsy-10x-x-109">\</span><span 
class="ectt-1095">s]*</span>&#8221; is truncation (matches all character sequences except space &#8217; &#8217;),<br 
class="newline" />&#8220;<span 
class="ectt-1095">glass art[</span><img 
src="DocMain4x.png" alt="&#x02C6;  "  class="circ" ><span 
class="cmsy-10x-x-109">\</span><span 
class="ectt-1095">s]*</span>&#8221;  will  match  &#8220;<span 
class="ectt-1095">glass art</span>&#8221;,  &#8220;<span 
class="ectt-1095">glass arts</span>&#8221;,  &#8220;<span 
class="ectt-1095">glass artists</span>&#8221;,
     &#8220;<span 
class="ectt-1095">glass articles</span>&#8221;, and so on.</li></ul>
<!--l. 358--><p class="indent" >   It is important to understand that each triplet in the topic definition is considered by itself
without any context, so they must <span 
class="ecbx-1095">each </span>be topic- or sub-class specific in order to be
useful. Subject neutral terms like &#8220;use&#8221;, &#8220;test&#8221;, &#8220;history&#8221; should not be used. If really
needed they have to be qualified so that they become topic specific (see examples
below).
<!--l. 366--><p class="indent" >Simple guidelines for creating the triplets and assigning weights are:
     <ul class="itemize1">
     <li class="itemize">Phrases or unique, topic-specific terms, should be used if possible, and assigned the
     highest weights, since they normally are most discriminatory.

     </li>
     <li class="itemize">Boolean AND-expressions are the next best.
     </li>
     <li class="itemize">Single words can be too general and/or have several meanings or uses that make
     them less specific and those should thus be assigned a small weights.
     </li>
     <li class="itemize">Acronyms can be used as terms if they are unique.
     </li>
     <li class="itemize">Negative weights should be used in order to exclude concepts.</li></ul>
<!--l. 380--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">4.5.2   </span> <a 
 id="x19-320004.5.2"></a>Topic definition (term triplets) BNF grammar</h5>
<!--l. 381--><p class="noindent" >TERM-LIST :== TERM-ROW &#8217;<span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">cr</span><span 
class="cmmi-10x-x-109">&#x003E;</span>&#8217; <span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">||</span></span></span> &#8217;<span 
class="ecbx-1095">#</span>&#8217; <span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">char</span><span 
class="cmmi-10x-x-109">&#x003E;</span>+ &#8217;<span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">cr</span><span 
class="cmmi-10x-x-109">&#x003E;</span>&#8217; <span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">||</span></span></span> &#8217;<span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">cr </span><span 
class="cmmi-10x-x-109">&#x003E;</span>&#8217; <br 
class="newline" />TERM-ROW :== WEIGHT &#8217;<span 
class="ecbx-1095">: </span>&#8217; TERMS &#8217;<span 
class="ecbx-1095">=</span>&#8217; CLASS-LIST <br 
class="newline" />WEIGHT :== [&#8217;<span 
class="ecbx-1095">-</span>&#8217;]<span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">integer</span><span 
class="cmmi-10x-x-109">&#x003E; </span><br 
class="newline" />TERMS :== TERM [&#8217; <span 
class="ecbx-1095">@and </span>&#8217; TERMS]* <br 
class="newline" />TERM :== WORD &#8217; &#8217; [WORD]* <br 
class="newline" />WORD :== <span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">char</span><span 
class="cmmi-10x-x-109">&#x003E;</span>+<span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">||</span></span></span><span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">char</span><span 
class="cmmi-10x-x-109">&#x003E;</span>+<span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">perl-reg-exp</span><span 
class="cmmi-10x-x-109">&#x003E; </span><br 
class="newline" />CLASS-LIST :== CLASSID [&#8217;<span 
class="ecbx-1095">,</span>&#8217; CLASS-LIST] <br 
class="newline" />CLASSID :== <span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">char</span><span 
class="cmmi-10x-x-109">&#x003E;</span>+ <br 
class="newline" />
<!--l. 391--><p class="indent" >   A line that starts with &#8217;#&#8217; is ignored and so are empty lines.
<!--l. 393--><p class="indent" >   <span 
class="cmmi-10x-x-109">&#x003C;</span><span 
class="ecbx-1095">perl-reg-exp</span><span 
class="cmmi-10x-x-109">&#x003E; </span>is only supported by the plain matching algorithm described in section
<a 
href="#x19-340004.5.4">4.5.4<!--tex4ht:ref: std --></a>.
<!--l. 396--><p class="indent" >   &#8220;CLASSID&#8221; is a topic (sub-)class specifier, often from a hierarchical classification system like Engineering
Index<span class="footnote-mark"><a 
href="DocMain23.html#fn16x0"><sup class="textsuperscript">16</sup></a></span><a 
 id="x19-32001f16"></a>.
   <h5 class="subsubsectionHead"><span class="titlemark">4.5.3   </span> <a 
 id="x19-330004.5.3"></a>Term triplet examples</h5>

   <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
50:&#x00A0;optical&#x00A0;glass=A.14.5,&#x00A0;D.2.2
&#x00A0;<br />30:&#x00A0;glass&#x00A0;@and&#x00A0;fiberoptics=D.2.2.8
&#x00A0;<br />50:&#x00A0;glass&#x00A0;@and&#x00A0;technical&#x00A0;@and&#x00A0;history=D.2
&#x00A0;<br />50:&#x00A0;ceramic&#x00A0;materials&#x00A0;@and&#x00A0;glass=D.2.1.7
&#x00A0;<br />-10000:&#x00A0;glass&#x00A0;@and&#x00A0;art=A
</div>
</td></tr></table>
<!--l. 407--><p class="nopar" >
<!--l. 409--><p class="indent" >   The first line says that a document containing the term &#8220;<span 
class="ectt-1095">optical glass</span>&#8221; should be awarded
50 points for each of the two classes A.14.5 and D.2.2.
<!--l. 413--><p class="indent" >   &#8220;<span 
class="ectt-1095">glass</span>&#8221; as a single term is probably too general, qualify it with more terms like: &#8220;<span 
class="ectt-1095">glass @and</span>
<span 
class="ectt-1095">fiberoptics</span>&#8221; , or &#8220;<span 
class="ectt-1095">glass @and technical @and history</span>&#8221; or use a phrase like &#8220;<span 
class="ectt-1095">glass fiber</span>&#8221;
or &#8220;<span 
class="ectt-1095">optical glass</span>&#8221;.
<!--l. 417--><p class="indent" >   In order to exclude documents about artistic use of glass the term &#8220;<span 
class="ectt-1095">glass @and art</span>&#8221; can be
used with a (high) negative score.
<!--l. 420--><p class="indent" >   An example from the topic definition for &#8217;Carnivorous Plants&#8217; using regular expressions is
given below:

   <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
#This&#x00A0;is&#x00A0;a&#x00A0;comment
&#x00A0;<br />75:&#x00A0;d\.?\s*californica=CP.Drosophyllum
&#x00A0;<br />10:&#x00A0;pitcher[^\s]*=CP
&#x00A0;<br />-10:&#x00A0;pitcher[^\s]*&#x00A0;@and&#x00A0;baseball=CP
</div>
</td></tr></table>
<!--l. 427--><p class="nopar" > The term &#8220;<span 
class="ectt-1095">d</span><span 
class="cmsy-10x-x-109">\</span><span 
class="ectt-1095">.?</span><span 
class="cmsy-10x-x-109">\</span><span 
class="ectt-1095">s*californica</span>&#8221; will match <span 
class="ectt-1095">D californica, D. californica,</span>
<span 
class="ectt-1095">D.californica </span>etc.
<!--l. 431--><p class="indent" >   The last two lines assure that a document containing &#8220;<span 
class="ectt-1095">pitcher</span>&#8221; gets 10 points but if the
document also contains &#8220;<span 
class="ectt-1095">baseball</span>&#8221; the points are removed.
<!--l. 434--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">4.5.4   </span> <a 
 id="x19-340004.5.4"></a>Algorithm 1: plain matching</h5>
<!--l. 437--><p class="noindent" >This algorithm is selected by setting the configuration parameter<br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;classifyPlugIn</span><span 
class="ectt-1095">&#x00A0;=</span><span 
class="ectt-1095">&#x00A0;Combine::Check_record</span></span></span>
<!--l. 440--><p class="indent" >   The algorithm produces a list of suggested topic-classes (subject classifications) and
corresponding relevance scores using the algorithm:<center class="par-math-display" >
<img 
src="DocMain5x.png" alt="Relevance_score =
" class="par-math-display" ></center>
<!--l. 444--><p class="nopar" >
   <center class="math-display" >
<img 
src="DocMain6x.png" alt="           (                                                              )
    &#x2211;          &#x2211;
           (         (hits[locationj][term  i]* weight[termi ]*weight [locationj]))
all locations  all terms
" class="math-display" ></center>
<!--l. 446--><p class="nopar" >
<!--l. 448--><p class="indent" >
     <dl class="description"><dt class="description">
<span 
class="ecbx-1095">term weight</span> </dt><dd 
class="description">(<span 
class="cmmi-10x-x-109">weight</span><span 
class="cmr-10x-x-109">[</span>term<sub><span 
class="cmmi-8">i</span></sub><span 
class="cmr-10x-x-109">]</span>) is taken from the topic definition triplets.
     </dd><dt class="description">
<span 
class="ecbx-1095">location weight</span> </dt><dd 
class="description">(<span 
class="cmmi-10x-x-109">weight</span><span 
class="cmr-10x-x-109">[</span>location<sub><span 
class="cmmi-8">j</span></sub><span 
class="cmr-10x-x-109">]</span>) are defined ad hoc for locations like title, metadata,
     HTML headings, and plain text. However the exact values for these weights does not
     seem to play a large role in the precision of the algorithm <span class="cite">[<a 
href="DocMainli2.html#Xardo05:_ECDL">10</a>]</span>.
     </dd><dt class="description">
<span 
class="ecbx-1095">hits</span> </dt><dd 
class="description">(<span 
class="cmmi-10x-x-109">hits</span><span 
class="cmr-10x-x-109">[</span>location<sub><span 
class="cmmi-8">j</span></sub><span 
class="cmr-10x-x-109">][</span>term<sub><span 
class="cmmi-8">i</span></sub><span 
class="cmr-10x-x-109">]</span>) is the number of times term<sub><span 
class="cmmi-8">i</span></sub> occur in the text of location<sub><span 
class="cmmi-8">j</span></sub></dd></dl>
<!--l. 459--><p class="indent" >   The summed relevance score might, for certain applications, have to be normalized with
respect to text size of the document.

<!--l. 462--><p class="indent" >   One problem with this algorithm is that a term that is found in the beginning of the text
contributes as much as a term that is found at the end of a large document. Another
problem is the distance and thus the coupling between two terms in a Boolean expression
might be very large in a big document and this is not taken into account by the above
algorithm.
<!--l. 469--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">4.5.5   </span> <a 
 id="x19-350004.5.5"></a>Algorithm 2: position weighted matching</h5>
<!--l. 471--><p class="noindent" >This algorithm is selected by setting the configuration parameter<br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;classifyPlugIn</span><span 
class="ectt-1095">&#x00A0;=</span><span 
class="ectt-1095">&#x00A0;Combine::PosCheck_record</span></span></span>
<!--l. 474--><p class="indent" >   In response to the problems cited above we developed a modified version of the algorithm
that takes into account word position in the text and proximity for Boolean terms. It also
eliminates the need to assign ad hoc weights to locations. The new algorithm works as
follows.
<!--l. 480--><p class="indent" >   First all text from all locations are concatenated (in the natural importance order title,
metadata, text) into one chunk of text. Matching of terms is done against this chunk. Relevance
score is calculated as<center class="par-math-display" >
<img 
src="DocMain7x.png" alt="Relevance_score =
" class="par-math-display" ></center>
<!--l. 485--><p class="nopar" >
   <center class="math-display" >
<img 
src="DocMain8x.png" alt="         (                                                                 )
   &#x2211;          &#x2211;                           weight[termi]
         (           ------------------------------------------------------)
all terms all matches log(k * position[termi ][matchj])* proximity[termi ][matchj]
" class="math-display" ></center>
<!--l. 487--><p class="nopar" >
<!--l. 489--><p class="indent" >
     <dl class="description"><dt class="description">
<span 
class="ecbx-1095">term weight</span> </dt><dd 
class="description">(<span 
class="cmmi-10x-x-109">weight</span><span 
class="cmr-10x-x-109">[</span>term<sub><span 
class="cmmi-8">i</span></sub><span 
class="cmr-10x-x-109">]</span>) is taken from the topic definition triplets
     </dd><dt class="description">
<span 
class="ecbx-1095">position</span> </dt><dd 
class="description">(<span 
class="cmmi-10x-x-109">position</span><span 
class="cmr-10x-x-109">[</span>term<sub><span 
class="cmmi-8">i</span></sub><span 
class="cmr-10x-x-109">][</span>match<sub><span 
class="cmmi-8">j</span></sub><span 
class="cmr-10x-x-109">]</span>) is the position in the text (starting from 1) for match<sub><span 
class="cmmi-8">j</span></sub>
     of term<sub><span 
class="cmmi-8">i</span></sub>. The constant factor <span 
class="cmmi-10x-x-109">k </span>is normally <span 
class="cmr-10x-x-109">0</span><span 
class="cmmi-10x-x-109">.</span><span 
class="cmr-10x-x-109">5</span>
     </dd><dt class="description">
<span 
class="ecbx-1095">proximity</span> </dt><dd 
class="description">(<span 
class="cmmi-10x-x-109">proximity</span><span 
class="cmr-10x-x-109">[</span>term<sub><span 
class="cmmi-8">i</span></sub><span 
class="cmr-10x-x-109">][</span>match<sub><span 
class="cmmi-8">j</span></sub><span 
class="cmr-10x-x-109">]</span>) is
     <div class="tabular"> <table class="tabular" 
cellspacing="0" cellpadding="0"  
><colgroup id="TBL-2-1g"><col 
id="TBL-2-1"><col 
id="TBL-2-2"></colgroup><tr  
 style="vertical-align:baseline;" id="TBL-2-1-"><td  style="white-space:nowrap; text-align:center;" id="TBL-2-1-1"  
class="td11">               1                       </td><td  style="white-space:nowrap; text-align:left;" id="TBL-2-1-2"  
class="td11">for non Boolean terms</td>
</tr><tr  
 style="vertical-align:baseline;" id="TBL-2-2-"><td  style="white-space:nowrap; text-align:center;" id="TBL-2-2-1"  
class="td11"><span 
class="cmr-10x-x-109">log</span><span 
class="cmr-10x-x-109">(</span><span 
class="cmmi-10x-x-109">distance</span>_<span 
class="cmmi-10x-x-109">between</span>_<span 
class="cmmi-10x-x-109">components</span><span 
class="cmr-10x-x-109">)</span></td><td  style="white-space:nowrap; text-align:left;" id="TBL-2-2-2"  
class="td11">for Boolean terms      </td>
</tr><tr  
 style="vertical-align:baseline;" id="TBL-2-3-"><td  style="white-space:nowrap; text-align:center;" id="TBL-2-3-1"  
class="td11">                               </td>
</tr></table></div></dd></dl>
<!--l. 505--><p class="indent" >   In this algorithm a matched term close to the start of text contributes more to the relevance
score than a match towards the end of the text. And for Boolean terms the closer the
components are the higher the contribution to the relevance score.

<!--l. 510--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.6   </span> <a 
 id="x19-360004.6"></a>Built-in topic filter &#8211; automated subject classification using SVM</h4>
<!--l. 511--><p class="noindent" >Topic filetring using SVM (Support Vector Machines) classifiers are supported using the SVMLight
package<span class="footnote-mark"><a 
href="DocMain24.html#fn17x0"><sup class="textsuperscript">17</sup></a></span><a 
 id="x19-36001f17"></a>.
This package has to be installed manually together with the
Algorithm::SVMLight Perl module. For installation hints see CPAN SVMLight
README<span class="footnote-mark"><a 
href="DocMain25.html#fn18x0"><sup class="textsuperscript">18</sup></a></span><a 
 id="x19-36002f18"></a> or
&#8217;installing-algorithm-svmlight-linux-ubuntu<span class="footnote-mark"><a 
href="DocMain26.html#fn19x0"><sup class="textsuperscript">19</sup></a></span><a 
 id="x19-36003f19"></a>&#8217;
<!--l. 517--><p class="indent" >   SVM classifiers need a trained model before they can be used.
<!--l. 519--><p class="indent" >   The procedure to get started is as follows:
     <ul class="itemize1">
     <li class="itemize">Make sure that Algorithm::SVMLight<span class="footnote-mark"><a 
href="DocMain27.html#fn20x0"><sup class="textsuperscript">20</sup></a></span><a 
 id="x19-36004f20"></a>
     and SVMLight<span class="footnote-mark"><a 
href="DocMain28.html#fn21x0"><sup class="textsuperscript">21</sup></a></span><a 
 id="x19-36005f21"></a>
     are installed.
     </li>
     <li class="itemize">Collect examples of good and bad URLs that defines your topic (the more the better).
     </li>
     <li class="itemize">Generate a SVM model with the program <span 
class="ectt-1095">combineSVM</span>.
     </li>
     <li class="itemize">Initialize a new job with <span 
class="ectt-1095">combineINIT</span>.
     </li>
     <li class="itemize">Copy     the     SVM     model     to     the     job&#8217;s     configuration     directory
     <span 
class="ectt-1095">/etc/combine/&#x003C;jobname&#x003E;/SVMmodel.txt</span>.
     </li>
     <li class="itemize">Edit the configuration file <span 
class="ectt-1095">/etc/combine/&#x003C;jobname&#x003E;/combine.cfg </span>and add the
     following:

     <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
     doCheckRecord&#x00A0;=&#x00A0;1
     &#x00A0;<br />classifyPlugIn&#x00A0;=&#x00A0;Combine::classifySVM
     &#x00A0;<br />SVMmodel&#x00A0;=&#x00A0;SVMmodel.txt
</div>
     </td></tr></table>
     <!--l. 531--><p class="nopar" >
     </li>
     <li class="itemize">Then proceed with crawling as normal.</li></ul>
   <h4 class="subsectionHead"><span class="titlemark">4.7   </span> <a 
 id="x19-370004.7"></a>Topic filter Plug-In API</h4>
<!--l. 536--><p class="noindent" >The configuration variable classifyPlugIn (section <a 
href="DocMainse9.html#x42-750009.1.4">9.1.4<!--tex4ht:ref: classifyPlugIn --></a>) is used to find the Perl module that
implements the desired topic filter. The value should be formatted as a valid Perl module
identifier (i.e. the module must be somewhere in the Perl module search path). Combine will call
a subroutine named &#8217;<span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">classify</span></span></span>&#8217; in this module, providing an XWI-object as in parameter. An
XWI-object is a structured object holding all information from parsing a Web-page. The
subroutine must return either 0 or 1, where<br 
class="newline" />&#x00A0;     0: means record fails to meet the classification criteria, i.e. ignore this record<br 
class="newline" />&#x00A0;     1: means record is OK, store it in the database, and follow the links
<!--l. 548--><p class="indent" >   More details on how to write a Plug-In can be found in the example classifyPlugInTemplate.pm
(see Appendix <a 
href="DocMainse11.html#x45-196000A.2">A.2<!--tex4ht:ref: classifyPlugInTemplate --></a>).
<!--l. 551--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.8   </span> <a 
 id="x19-380004.8"></a>Analysis</h4>
<!--l. 553--><p class="noindent" >Extra analysis, enabled by the configuration variable doAnalyse (section <a 
href="DocMainse9.html#x42-770009.1.6">9.1.6<!--tex4ht:ref: doAnalyse --></a>), tries to
determine the language of the content and the country of the Web-server. Both are stored in the
internal database.
<!--l. 558--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.9   </span> <a 
 id="x19-390004.9"></a>Duplicate detection</h4>
<!--l. 559--><p class="noindent" >Duplicates of crawled documents are automatically detected with the aid of a MD5-checksum
calculated on the contents of the document.
<!--l. 562--><p class="indent" >   The MD5-checksum is used as the master record key in the internal database thus preventing
pollution with duplicate pages. All URLs for a page are stored in the record, and a page is not
deleted from the database until the crawler has verified that it&#8217;s unavailable from all the saved
URLs.

<!--l. 568--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.10   </span> <a 
 id="x19-400004.10"></a>URL recycling</h4>
<!--l. 569--><p class="noindent" >URLs for recycling come from 3 sources:
     <ul class="itemize1">
     <li class="itemize">Links extracted during HTML parsing.
     </li>
     <li class="itemize">Redirects (unless configuration variable UserAgentFollowRedirects (section <a 
href="DocMainse9.html#x42-1010009.1.30">9.1.30<!--tex4ht:ref: UserAgentFollowRedirects --></a>) is
     set).
     </li>
     <li class="itemize">URLs   extracted   from   plain   text   (enabled   by   the   configuration   variable
     extractLinksFromText (section <a 
href="DocMainse9.html#x42-800009.1.9">9.1.9<!--tex4ht:ref: extractLinksFromText --></a>)).</li></ul>
<!--l. 577--><p class="indent" >   Automatic recycling of URLs is enabled by the configuration variable AutoRecycleLinks
(section <a 
href="DocMainse9.html#x42-730009.1.2">9.1.2<!--tex4ht:ref: AutoRecycleLinks --></a>). It can also be done manually with the command<br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">combineCtrl</span><span 
class="ectt-1095">&#x00A0;--jobname</span><span 
class="ectt-1095">&#x00A0;XXXX</span><span 
class="ectt-1095">&#x00A0;recyclelinks</span></span></span>
<!--l. 582--><p class="indent" >   The command <span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">combineCtrl</span><span 
class="ectt-1095">&#x00A0;--jobname</span><span 
class="ectt-1095">&#x00A0;XXXX</span><span 
class="ectt-1095">&#x00A0;reharvest</span></span></span> marks all pages in the database for
harvesting again.
<!--l. 585--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.11   </span> <a 
 id="x19-410004.11"></a>Database cleaning</h4>
<!--l. 587--><p class="noindent" >The tool <span 
class="ectt-1095">combineUtil </span>implements functionality for cleaning the database.
<!--l. 589--><p class="indent" >
     <dl class="description"><dt class="description">
<span 
class="ecbx-1095">sanity/restoreSanity</span> </dt><dd 
class="description">checks respectively restore consistency of the internal database.
     </dd><dt class="description">
<span 
class="ecbx-1095">deleteNetLoc/deletePath/deleteMD5/deleteRecordid</span> </dt><dd 
class="description">deletes  records  from  the
     database based on supplied parameters.
     </dd><dt class="description">
<span 
class="ecbx-1095">serverAlias</span> </dt><dd 
class="description">detects Web-server aliases in the database. All detected alias groups are
     added to the serveralias configuration (section <a 
href="DocMainse9.html#x42-1210009.2.5">9.2.5<!--tex4ht:ref: serveralias --></a>). Records from aliased servers
     (except for the first Web-server) will be deleted.</dd></dl>
<!--l. 601--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">4.12   </span> <a 
 id="x19-420004.12"></a>Complete application &#8211; SearchEngine in a Box</h4>
<!--l. 603--><p class="noindent" >The SearchEngine-in-a-Box<span class="footnote-mark"><a 
href="DocMain29.html#fn22x0"><sup class="textsuperscript">22</sup></a></span><a 
 id="x19-42001f22"></a>
system is based on the two systems Combine Focused Crawler and Zebra text indexing and retrieval
engine<span class="footnote-mark"><a 
href="DocMain30.html#fn23x0"><sup class="textsuperscript">23</sup></a></span><a 
 id="x19-42002f23"></a>.

This system allows you build a vertical search engine for your favorite topic in a few easy
steps.
<!--l. 611--><p class="indent" >   The SearchEngine-in-a-Box Web-site contains instructions and downloads to make this
happen. Basically it makes use of the ZebraHost (see section <a 
href="DocMainse9.html#x42-1150009.1.44">9.1.44<!--tex4ht:ref: ZebraHost --></a>) configuration variable which
enables direct communication between the crawler and the database system and thus
indexes records as soon as they are crawled. This also means that they are directly
searchable.

   <!--l. 1--><div class="crosslinks"><p class="noindent">[<a 
href="DocMainse3.html" >prev</a>] [<a 
href="DocMainse3.html#tailDocMainse3.html" >prev-tail</a>] [<a 
href="DocMainse4.html" >front</a>] [<a 
href="DocMainpa1.html# " >up</a>] </p></div>
<!--l. 1--><p class="indent" >   <a 
 id="tailDocMainse4.html"></a>   
</body></html>