The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"  
  "http://www.w3.org/TR/html4/loose.dtd">  
<html > 
<head><title>Introduction</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
<meta name="generator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)"> 
<meta name="originator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)"> 
<!-- html,2 --> 
<meta name="src" content="DocMain.tex"> 
<meta name="date" content="2009-06-16 09:20:00"> 
<link rel="stylesheet" type="text/css" href="DocMain.css"> 
</head><body 
>
   <!--l. 1--><div class="crosslinks"><p class="noindent">[<a 
href="DocMainse5.html" >next</a>] [<a 
href="#tailDocMainse1.html">tail</a>] [<a 
href="DocMainpa1.html#DocMainse1.html" >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">1   </span> <a 
 id="x4-30001"></a>Introduction</h3>
<!--l. 9--><p class="noindent" >The Combine system is an open, free, and highly configurable system for focused crawling of
Internet resources. It aims at providing a robust and efficient tool for creating topic-specific
moderate sized databases (up to a few million records). Crawling speed is around 200
URLs per minute and a complete structured record takes up an average of 25 kilobytes
disk-space.
<!--l. 16--><p class="indent" >   <hr class="figure"><div class="figure" 
><table class="figure"><tr class="figure"><td class="figure" 
>

<a 
 id="x4-30011"></a>

<div class="center" 
>
<!--l. 17--><p class="noindent" >
<!--l. 18--><p class="noindent" ><img 
src="DocMain0x.png" alt="PIC" class="graphics" width="256.08429pt" height="196.32939pt" ><!--tex4ht:graphics  
name="DocMain0x.png" src="focusedrobot.eps"  
--></div>
<br /> <table class="caption" 
><tr style="vertical-align:baseline;" class="caption"><td class="id">Figure&#x00A0;1: </td><td  
class="content">Overview of the Combine focused crawler.</td></tr></table><!--tex4ht:label?: x4-30011 -->

<!--l. 22--><p class="indent" >   </td></tr></table></div><hr class="endfigure">
<!--l. 24--><p class="indent" >Main features include:
     <ul class="itemize1">
     <li class="itemize">part of the SearchEngine-in-a-Box<span class="footnote-mark"><a 
href="DocMain5.html#fn1x0"><sup class="textsuperscript">1</sup></a></span><a 
 id="x4-3002f1"></a>
     system
     </li>
     <li class="itemize">extensive configuration possibilities
     </li>
     <li class="itemize">integrated topic filter (automated topic classifier) for focused crawling mode
     </li>
     <li class="itemize">possibility to use any topic filter (if provided as a Perl Plug-In module<span class="footnote-mark"><a 
href="DocMain6.html#fn2x0"><sup class="textsuperscript">2</sup></a></span><a 
 id="x4-3003f2"></a>)
     in focused crawling mode
     </li>
     <li class="itemize">crawl limitations based on regular expression on URLs - both include and exclude
     rules (URL focus filter)
     </li>
     <li class="itemize">character set detection/normalization
     </li>
     <li class="itemize">language detection
     </li>
     <li class="itemize">HTML cleaning
     </li>
     <li class="itemize">metadata extraction
     </li>
     <li class="itemize">duplicate detection
     </li>
     <li class="itemize">HTML parsing to provide structured records for each crawled page
     </li>
     <li class="itemize">support  for  many  document  formats  (text,  HTML,  PDF,  PostScript,  MSWord,
     MSPowerPoint, MSExcel, RTF, TeX, images)
     </li>
     <li class="itemize">SQL database for data storage and administration</li></ul>
<!--l. 42--><p class="indent" >   Naturally it obeys the Robots Exclusion
Protocol<span class="footnote-mark"><a 
href="DocMain7.html#fn3x0"><sup class="textsuperscript">3</sup></a></span><a 
 id="x4-3004f3"></a>

and behaves nice to Web-servers. Besides focused crawls (generating topic-specific databases),
Combine supports configurable rules on what&#8217;s crawled based on regular expressions on URLs
(URL focus filter). The crawler is designed to run continuously in order to keep crawled
databases as up to date as possible. It can be stopped and restarted any time without loosing
any status or information.
<!--l. 51--><p class="indent" >   The operation of Combine (overview in Figure <a 
href="#x4-30011">1<!--tex4ht:ref: overview --></a>) as a focused crawler is based on a
combination of a general Web crawler and an automated subject classifier. The topic focus is
provided by a focus filter using a topic definition implemented as a thesaurus, where each term is
connected to a topic class.
<!--l. 57--><p class="indent" >   Crawled data are stored as a structured records in a local relational database.
<!--l. 61--><p class="indent" >   Section <a 
href="DocMainse2.html#x8-40002">2<!--tex4ht:ref: distr --></a> outlines how to download, install and test the Combine system and includes use
scenarios &#8211; useful in order to get a jump start at using the system.
<!--l. 64--><p class="indent" >   Section <a 
href="DocMainse3.html#x18-190003">3<!--tex4ht:ref: configuration --></a> discusses configuration structure and highlights a few important configuration
variables.
<!--l. 66--><p class="indent" >   Section <a 
href="DocMainse4.html#x19-250004">4<!--tex4ht:ref: operation --></a> describes policies and methods used by the crawler.
<!--l. 68--><p class="indent" >   Evaluation and performance are treated in sections <a 
href="DocMainse5.html#x31-430005">5<!--tex4ht:ref: autoclasseval --></a> and <a 
href="DocMainse6.html#x34-560006">6<!--tex4ht:ref: performance --></a>.
<!--l. 73--><p class="indent" >   The system has a number of components (see section <a 
href="DocMainse7.html#x35-600007">7<!--tex4ht:ref: comp --></a>), the main ones visible to the user
being <span 
class="ectt-1095">combineCtrl </span>which is used to start and stop crawling and view crawler status, and
<span 
class="ectt-1095">combineExport </span>that extracts crawled data from the internal database and exports them as XML
records.
<!--l. 79--><p class="indent" >   Further details (lots and lots of them) can be found in part <a 
href="DocMainpa2.html#x40-68000II">II<!--tex4ht:ref: gory --></a> &#8217;Gory details&#8217; and in Appendix
<a 
href="DocMainse11.html#x45-193000A">A<!--tex4ht:ref: appendix --></a>.

   <!--l. 1--><div class="crosslinks"><p class="noindent">[<a 
href="DocMainse5.html" >next</a>] [<a 
href="DocMainse1.html" >front</a>] [<a 
href="DocMainpa1.html#DocMainse1.html" >up</a>] </p></div>
<!--l. 1--><p class="indent" >   <a 
 id="tailDocMainse1.html"></a>  
</body></html>