doc/HTML/DocMainse2.html

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"  
  "http://www.w3.org/TR/html4/loose.dtd">  
<html > 
<head><title>Open source distribution, installation</title> 
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> 
<meta name="generator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)"> 
<meta name="originator" content="TeX4ht (http://www.cse.ohio-state.edu/~gurari/TeX4ht/)"> 
<!-- html,2 --> 
<meta name="src" content="DocMain.tex"> 
<meta name="date" content="2009-06-16 09:20:00"> 
<link rel="stylesheet" type="text/css" href="DocMain.css"> 
</head><body 
>
   <!--l. 1--><div class="crosslinks"><p class="noindent">[<a 
href="DocMainse1.html" >prev</a>] [<a 
href="DocMainse1.html#tailDocMainse1.html" >prev-tail</a>] [<a 
href="#tailDocMainse2.html">tail</a>] [<a 
href="DocMainpa1.html#DocMainse5.html" >up</a>] </p></div>
   <h3 class="sectionHead"><span class="titlemark">2   </span> <a 
 id="x8-40002"></a>Open source distribution, installation</h3>
<!--l. 3--><p class="noindent" >The focused crawler has been restructured and packaged as a Debian package in order to ease
distribution and installation. The package contains dependency information to make sure
that all software that is needed to run the crawler is installed at the same time. In
connection with this we have also packaged a number of necessary Perl-modules as Debian
packages.
<!--l. 10--><p class="indent" >All software and packages are available from a number of places:
     <ul class="itemize1">
     <li class="itemize">the Combine focused crawler Web-site<span class="footnote-mark"><a 
href="DocMain9.html#fn4x0"><sup class="textsuperscript">4</sup></a></span><a 
 id="x8-4001f4"></a>
     </li>
     <li class="itemize">the Comprehensive Perl Archive Network - CPAN<span class="footnote-mark"><a 
href="DocMain10.html#fn5x0"><sup class="textsuperscript">5</sup></a></span><a 
 id="x8-4002f5"></a>
     </li>
     <li class="itemize">SourceForge project &#8220;Combine focused crawler&#8221;<span class="footnote-mark"><a 
href="DocMain11.html#fn6x0"><sup class="textsuperscript">6</sup></a></span><a 
 id="x8-4003f6"></a></li></ul>
<!--l. 17--><p class="indent" >   In addition to the distribution sites there is a public discussion list at
SourceForge<span class="footnote-mark"><a 
href="DocMain12.html#fn7x0"><sup class="textsuperscript">7</sup></a></span><a 
 id="x8-4004f7"></a>.
   <h4 class="subsectionHead"><span class="titlemark">2.1   </span> <a 
 id="x8-50002.1"></a>Installation</h4>
<!--l. 21--><p class="noindent" >This distribution is developed and tested on Linux systems. It is implemented entirely in Perl and uses
the MySQL<span class="footnote-mark"><a 
href="DocMain13.html#fn8x0"><sup class="textsuperscript">8</sup></a></span><a 
 id="x8-5001f8"></a>
database system, both of which are supported on many other operating systems. Porting to other
UNIX dialects should be easy.
<!--l. 26--><p class="indent" >   The system is distributed either as source or as a Debian package.
   <h5 class="subsubsectionHead"><span class="titlemark">2.1.1   </span> <a 
 id="x8-60002.1.1"></a>Installation from source for the impatient</h5>
<!--l. 29--><p class="noindent" >Unless you are on a system supporting Debian packages (in which case look at Automated
installation (section <a 
href="#x8-80002.1.3">2.1.3<!--tex4ht:ref: debian --></a>)), you should download and unpack the source. The following
command sequence will then install Combine:

   <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
perl&#x00A0;Makefile.PL
&#x00A0;<br />make
&#x00A0;<br />make&#x00A0;test
&#x00A0;<br />make&#x00A0;install
&#x00A0;<br />mkdir&#x00A0;/etc/combine
&#x00A0;<br />cp&#x00A0;conf/*&#x00A0;/etc/combine/
&#x00A0;<br />mkdir&#x00A0;/var/run/combine
</div>
</td></tr></table>
<!--l. 40--><p class="nopar" >
<!--l. 42--><p class="indent" >   Test that it all works (run as root)<br 
class="newline" /><span 
class="ectt-1095">./doc/InstallationTest.pl</span>
<!--l. 45--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.1.2   </span> <a 
 id="x8-70002.1.2"></a>Porting to not supported operating systems - dependencies</h5>
<!--l. 46--><p class="noindent" >In order to port the system to another platform, you have to verify the availability, for this
platform, of the two main systems:
     <ul class="itemize1">
     <li class="itemize">Perl<span class="footnote-mark"><a 
href="DocMain14.html#fn9x0"><sup class="textsuperscript">9</sup></a></span><a 
 id="x8-7001f9"></a>
     </li>
     <li class="itemize">MySQL version <span 
class="cmsy-10x-x-109">&#x2265; </span>4.1<span class="footnote-mark"><a 
href="DocMain15.html#fn10x0"><sup class="textsuperscript">10</sup></a></span><a 
 id="x8-7002f10"></a></li></ul>
<!--l. 52--><p class="noindent" >If they are supported you stand a good chance to port the system.
<!--l. 54--><p class="indent" >   Furthermore, the external Perl modules (listed in <a 
href="DocMainse10.html#x43-19100010.3">10.3<!--tex4ht:ref: extmods --></a>) should be verified to work on the new
platform.
<!--l. 73--><p class="indent" >   Perl modules are most easily installed using the Perl CPAN automated system<br 
class="newline" />(<span 
class="ectt-1095">perl -MCPAN -e shell</span>).
<!--l. 77--><p class="indent" >Optionally the following external programs will be used if they are installed on your
system:
     <ul class="itemize1">
     <li class="itemize">antiword (parsing MSWord files)
     </li>
     <li class="itemize">detex (parsing TeX files)
     </li>
     <li class="itemize">pdftohtml (parsing PDF files)
     </li>
     <li class="itemize">pstotext (parsing PS and PDF files, needs ghostview)

     </li>
     <li class="itemize">xlhtml (parsing MSExcel files)
     </li>
     <li class="itemize">ppthtml (parsing MSPowerPoint files)
     </li>
     <li class="itemize">unrtf (parsing RTF files)
     </li>
     <li class="itemize">tth (parsing TeX files)
     </li>
     <li class="itemize">untex (parsing TeX files)</li></ul>
<!--l. 91--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.1.3   </span> <a 
 id="x8-80002.1.3"></a>Automated Debian/Ubuntu installation</h5>
     <ul class="itemize1">
     <li class="itemize">Add the following lines to your /etc/apt/sources.list:<br 
class="newline" /><span 
class="ectt-1095">deb http://combine.it.lth.se/ debian/</span>
     </li>
     <li class="itemize">Give the commands:<br 
class="newline" /><span 
class="ectt-1095">apt-get update</span><br 
class="newline" /><span 
class="ectt-1095">apt-get install combine</span></li></ul>
<!--l. 100--><p class="noindent" >This also installs all dependencies such as MySQL and a lot of necessary Perl modules.
<!--l. 103--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.1.4   </span> <a 
 id="x8-90002.1.4"></a>Manual installation</h5>
<!--l. 105--><p class="noindent" >Download the latest distribution<span class="footnote-mark"><a 
href="DocMain16.html#fn11x0"><sup class="textsuperscript">11</sup></a></span><a 
 id="x8-9001f11"></a>.
<!--l. 107--><p class="indent" >   Install all software that Combine depends on (see above).
<!--l. 109--><p class="indent" >   Unpack the archive with <span 
class="ectt-1095">tar zxf </span><br 
class="newline" />This will create a directory named <span 
class="ectt-1095">combine-XX </span>with a number of subdirectories including <span 
class="ectt-1095">bin,</span>
<span 
class="ectt-1095">Combine, doc, and conf</span>.
<!--l. 113--><p class="indent" >   &#8217;<span 
class="ectt-1095">bin</span>&#8217; contains the executable programs.
<!--l. 115--><p class="indent" >   &#8217;<span 
class="ectt-1095">Combine</span>&#8217; contains needed Perl modules. They should be copied to where Perl will find them,
typically <span 
class="ectt-1095">/usr/share/perl5/Combine/</span>.
<!--l. 118--><p class="indent" >   &#8217;<span 
class="ectt-1095">conf</span>&#8217; contains the default configuration files. Combine looks for them in <span 
class="ectt-1095">/etc/combine/ </span>so
they need to be copied there.
<!--l. 121--><p class="indent" >   &#8217;<span 
class="ectt-1095">doc</span>&#8217; contains documentation.
<!--l. 123--><p class="indent" >   The following command sequence will install Combine:

   <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
perl&#x00A0;Makefile.PL
&#x00A0;<br />make
&#x00A0;<br />make&#x00A0;test
&#x00A0;<br />make&#x00A0;install
&#x00A0;<br />mkdir&#x00A0;/etc/combine
&#x00A0;<br />cp&#x00A0;conf/*&#x00A0;/etc/combine/
&#x00A0;<br />mkdir&#x00A0;/var/run/combine
</div>
</td></tr></table>
<!--l. 132--><p class="nopar" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.1.5   </span> <a 
 id="x8-100002.1.5"></a>Out-of-the-box installation test</h5>
<!--l. 135--><p class="noindent" >A simple way to test your newly installed Combine system is to crawl just one Web-page and
export it as an XML-document. This will exercise much of the code and guarantee that basic
focused crawling will work.
     <ul class="itemize1">
     <li class="itemize">Initialize a crawl-job named aatest. This will create and populate the job-specific
     configuration directory and create the MySQL database that will hold the records:</li></ul>

<table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
sudo&#x00A0;combineINIT&#x00A0;--jobname&#x00A0;aatest&#x00A0;--topic&#x00A0;/etc/combine/Topic_carnivor.txt
</div>
</td></tr></table>
<!--l. 146--><p class="nopar" >
     <ul class="itemize1">
     <li class="itemize">Harvest the test URL by:</li></ul>

<table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
combine&#x00A0;--jobname&#x00A0;aatest
&#x00A0;<br />&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;&#x00A0;--harvest&#x00A0;http://combine.it.lth.se/CombineTests/InstallationTest.html
</div>
</td></tr></table>
<!--l. 153--><p class="nopar" >
     <ul class="itemize1">
     <li class="itemize">Export a structured Dublin Core record by:</li></ul>

<table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
combineExport&#x00A0;--jobname&#x00A0;aatest&#x00A0;--profile&#x00A0;dc
</div>
</td></tr></table>
<!--l. 159--><p class="nopar" >
     <ul class="itemize1">
     <li class="itemize">and verify that the output, except for dates and order, looks like:</li></ul>

<table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
&#x003C;?xml&#x00A0;version="1.0"&#x00A0;encoding="UTF-8"?&#x003E;
&#x00A0;<br />&#x003C;documentCollection&#x00A0;version="1.1"&#x00A0;xmlns:dc="http://purl.org/dc/elements/1.1/"&#x003E;
&#x00A0;<br />&#x003C;metadata&#x00A0;xmlns:dc="http://purl.org/dc/elements/1.1/"&#x003E;
&#x00A0;<br />&#x003C;dc:format&#x003E;text/html&#x003C;/dc:format&#x003E;
&#x00A0;<br />&#x003C;dc:format&#x003E;text/html;&#x00A0;charset=iso-8859-1&#x003C;/dc:format&#x003E;
&#x00A0;<br />&#x003C;dc:subject&#x003E;Carnivorous&#x00A0;plants&#x003C;/dc:subject&#x003E;
&#x00A0;<br />&#x003C;dc:subject&#x003E;Drosera&#x003C;/dc:subject&#x003E;
&#x00A0;<br />&#x003C;dc:subject&#x003E;Nepenthes&#x003C;/dc:subject&#x003E;
&#x00A0;<br />&#x003C;dc:title&#x00A0;transl="yes"&#x003E;Installation&#x00A0;test&#x00A0;for&#x00A0;Combine&#x003C;/dc:title&#x003E;
&#x00A0;<br />&#x003C;dc:description&#x003E;&#x003C;/dc:description&#x003E;
&#x00A0;<br />&#x003C;dc:date&#x003E;2006-05-19&#x00A0;9:57:03&#x003C;/dc:date&#x003E;
&#x00A0;<br />&#x003C;dc:identifier&#x003E;http://combine.it.lth.se/CombineTests/InstallationTest.html&#x003C;/dc:identifier&#x003E;
&#x00A0;<br />&#x003C;dc:language&#x003E;en&#x003C;/dc:language&#x003E;
&#x00A0;<br />&#x003C;/metadata&#x003E;
</div>
</td></tr></table>
<!--l. 178--><p class="nopar" >
<!--l. 180--><p class="indent" >   Or run &#8211; as root &#8211; the script <span 
class="ectt-1095">./doc/InstallationTest.pl </span>(see <a 
href="DocMainse11.html#x45-194000A.1">A.1<!--tex4ht:ref: InstTest --></a> in the Appendix) which
essentially does the same thing.
<!--l. 185--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">2.2   </span> <a 
 id="x8-110002.2"></a>Getting started</h4>
<!--l. 187--><p class="noindent" >A simple example work-flow for a trivial crawl job name &#8217;aatest&#8217; might look like:
<!--l. 189--><p class="indent" >
     <ol  class="enumerate1" >
     <li 
  class="enumerate" id="x8-11002x1">Initialize database and configuration (needs root privileges)<br 
class="newline" /><span 
class="ectt-1095">sudo combineINIT </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname aatest</span>
     </li>
     <li 
  class="enumerate" id="x8-11004x2"><a 
 id="x8-110032"></a> Load some seed URLs like (you can repeat this command with different URLs as
     many times as you wish)<br 
class="newline" /><span 
class="ectt-1095">echo &#8217;http://combine.it.lth.se/&#8217; | combineCtrl load </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname aatest</span>
     </li>
     <li 
  class="enumerate" id="x8-11006x3"><a 
 id="x8-110053"></a> Start 2 harvesting processes<br 
class="newline" /><span 
class="ectt-1095">combineCtrl start </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname aatest </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">harvesters 2</span>
     </li>
     <li 
  class="enumerate" id="x8-11008x4">Let it run for some time. Status and progress can be checked using the program
     &#8217;<span 
class="ectt-1095">combineCtrl </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname aatest</span>&#8217; with various parameters.
     </li>
     <li 
  class="enumerate" id="x8-11010x5">When satisfied kill the crawlers<br 
class="newline" /><span 
class="ectt-1095">combineCtrl kill </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname aatest</span>

     </li>
     <li 
  class="enumerate" id="x8-11012x6">Export data records in the ALVIS XML format<br 
class="newline" /><span 
class="ectt-1095">combineExport </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname aatest </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">profile alvis</span>
     </li>
     <li 
  class="enumerate" id="x8-11014x7">If you want to schedule a recheck for all the crawled pages stored in the database do<br 
class="newline" /><span 
class="ectt-1095">combineCtrl reharvest </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname aatest</span>
     </li>
     <li 
  class="enumerate" id="x8-11016x8">Go back to <a 
href="#x8-110053">3<!--tex4ht:ref: crawl --></a> for continuous operation.</li></ol>
<!--l. 211--><p class="indent" >   Once a job is initialized it is controlled using <span 
class="ectt-1095">combineCtrl</span>. Crawled data is exported using
<span 
class="ectt-1095">combineExport</span>.
<!--l. 214--><p class="noindent" >
   <h4 class="subsectionHead"><span class="titlemark">2.3   </span> <a 
 id="x8-120002.3"></a>Online documentation</h4>
<!--l. 215--><p class="noindent" >The latest, updated, detailed documentation is always available
online<span class="footnote-mark"><a 
href="DocMain17.html#fn12x0"><sup class="textsuperscript">12</sup></a></span><a 
 id="x8-12001f12"></a>.
   <h4 class="subsectionHead"><span class="titlemark">2.4   </span> <a 
 id="x8-130002.4"></a>Use scenarios</h4>
<!--l. 219--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.4.1   </span> <a 
 id="x8-140002.4.1"></a>General crawling without restrictions</h5>
<!--l. 220--><p class="noindent" >Use the same procedure as in section <a 
href="#x8-110002.2">2.2<!--tex4ht:ref: gettingstarted --></a>. This way of crawling is not recommended for the
Combine system since it will generate really huge databases without any focus.
<!--l. 224--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.4.2   </span> <a 
 id="x8-150002.4.2"></a>Focused crawling &#8211; domain restrictions</h5>
<!--l. 226--><p class="noindent" >Create a focused database with all pages from a Web-site. In this use scenario we will crawl the
Combine site and the ALVIS site. The database is to be continuously updated, i.e. all pages have
to be regularly tested for changes, deleted pages should be removed from the database, and
newly created pages added.
     <ol  class="enumerate1" >
     <li 
  class="enumerate" id="x8-15002x1">Initialize database and configuration<br 
class="newline" /><span 
class="ectt-1095">sudo combineINIT </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname focustest</span>
     </li>
     <li 
  class="enumerate" id="x8-15004x2">Edit the configuration to provide the desired focus<br 
class="newline" />Change the <span 
class="ectt-1095">&#x003C;allow&#x003E; </span>part in <span 
class="ectt-1095">/etc/combine/focustest/combine.cfg </span>from

     <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
     #use&#x00A0;either&#x00A0;URL&#x00A0;or&#x00A0;HOST:&#x00A0;(obs&#x00A0;&#8217;:&#8217;)&#x00A0;to&#x00A0;match&#x00A0;regular&#x00A0;expressions&#x00A0;to&#x00A0;either&#x00A0;the
     &#x00A0;<br />#full&#x00A0;URL&#x00A0;or&#x00A0;the&#x00A0;HOST&#x00A0;part&#x00A0;of&#x00A0;a&#x00A0;URL.
     &#x00A0;<br />&#x003C;allow&#x003E;
     &#x00A0;<br />#Allow&#x00A0;crawl&#x00A0;of&#x00A0;URLs&#x00A0;or&#x00A0;hostnames&#x00A0;that&#x00A0;matches&#x00A0;these&#x00A0;regular&#x00A0;expressions
     &#x00A0;<br />HOST:&#x00A0;.*$
     &#x00A0;<br />&#x003C;/allow&#x003E;
</div>
     </td></tr></table>
     <!--l. 244--><p class="nopar" > to<br 
class="newline" />

     <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
     #use&#x00A0;either&#x00A0;URL&#x00A0;or&#x00A0;HOST:&#x00A0;(obs&#x00A0;&#8217;:&#8217;)&#x00A0;to&#x00A0;match&#x00A0;regular&#x00A0;expressions&#x00A0;to&#x00A0;either&#x00A0;the
     &#x00A0;<br />#full&#x00A0;URL&#x00A0;or&#x00A0;the&#x00A0;HOST&#x00A0;part&#x00A0;of&#x00A0;a&#x00A0;URL.
     &#x00A0;<br />&#x003C;allow&#x003E;
     &#x00A0;<br />#Allow&#x00A0;crawl&#x00A0;of&#x00A0;URLs&#x00A0;or&#x00A0;hostnames&#x00A0;that&#x00A0;matches&#x00A0;these&#x00A0;regular&#x00A0;expressions
     &#x00A0;<br />HOST:&#x00A0;www\.alvis\.info$
     &#x00A0;<br />HOST:&#x00A0;combine\.it\.lth\.se$
     &#x00A0;<br />&#x003C;/allow&#x003E;
</div>
     </td></tr></table>
     <!--l. 254--><p class="nopar" > The escaping of &#8217;.&#8217; by writing &#8217;<span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">\.</span></span></span>&#8217; is necessary since the patterns actually are Perl regular
     expressions. Similarly the ending &#8217;$&#8217; indicates that the host string should end here, so
     for example a Web server on <span 
class="ectt-1095">www.alvis.info.com </span>(if such exists) will not be
     crawled.
     </li>
     <li 
  class="enumerate" id="x8-15006x3">Load seed URLs<br 
class="newline" /><span 
class="ectt-1095">echo &#8217;http://combine.it.lth.se/&#8217; | combineCtrl load </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname</span><span 
class="ectt-1095">&#x00A0;focustest</span><br 
class="newline" /><span 
class="ectt-1095">echo &#8217;http://www.alvis.info/&#8217; | combineCtrl load </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname focustest</span>
     </li>
     <li 
  class="enumerate" id="x8-15008x4">Start 1 harvesting process<br 
class="newline" /><span 
class="ectt-1095">combineCtrl start </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname focustest</span>
     </li>
     <li 
  class="enumerate" id="x8-15010x5">Daily export all data records in the ALVIS XML format<br 
class="newline" /><span 
class="ectt-1095">combineExport </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname focustest </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">profile alvis</span><br 
class="newline" />and schedule all pages for re-harvesting<br 
class="newline" /><span 
class="ectt-1095">combineCtrl reharvest </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname focustest</span></li></ol>
<!--l. 275--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.4.3   </span> <a 
 id="x8-160002.4.3"></a>Focused crawling &#8211; topic specific</h5>
<!--l. 277--><p class="noindent" >Create and maintain a topic specific crawled database for the topic &#8217;Carnivorous plants&#8217;.
<!--l. 279--><p class="indent" >
     <ol  class="enumerate1" >
     <li 
  class="enumerate" id="x8-16002x1">Create a topic definition (see section <a 
href="DocMainse4.html#x19-310004.5.1">4.5.1<!--tex4ht:ref: topicdef --></a>) in a local file named <span 
class="ectt-1095">cpTopic.txt</span>. (Can
     be done by copying <span 
class="ectt-1095">/etc/combine/Topic_carnivor.txt </span>since it happens to be just
     that.)
     </li>
     <li 
  class="enumerate" id="x8-16004x2">Create a file named <span 
class="ectt-1095">cpSeedURLs.txt </span>with seed URLs for this topic, containing the
     URLs:

     <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
     http://www.sarracenia.com/faq.html
     &#x00A0;<br />http://dmoz.org/Home/Gardening/Plants/Carnivorous_Plants/
     &#x00A0;<br />http://www.omnisterra.com/bot/cp_home.cgi
     &#x00A0;<br />http://www.vcps.au.com/
     &#x00A0;<br />http://www.murevarn.se/links.html
</div>
     </td></tr></table>
     <!--l. 290--><p class="nopar" >
     </li>
     <li 
  class="enumerate" id="x8-16006x3">Initialization<br 
class="newline" /><span 
class="ectt-1095">sudo combineINIT </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname cptest </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">topic cpTopic.txt</span>
     <!--l. 296--><p class="noindent" >This enables topic checking and focused crawl mode by setting configuration
     variable <span 
class="ectt-1095">doCheckRecord = 1 </span>and copying a topic definition file (<span 
class="ectt-1095">cpTopic.txt</span>)
     to<br 
class="newline" /><span 
class="ectt-1095">/etc/combine/cptest/topicdefinition.txt</span>.
     </li>
     <li 
  class="enumerate" id="x8-16008x4">Load seed URLs<br 
class="newline" /><span 
class="ectt-1095">combineCtrl load </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname cptest &#x003C; cpSeedURLs.txt</span>
     </li>
     <li 
  class="enumerate" id="x8-16010x5">Start 3 harvesting process<br 
class="newline" /><span 
class="ectt-1095">combineCtrl start </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname cptest </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">harvesters 3</span>
     </li>
     <li 
  class="enumerate" id="x8-16012x6">Regularly export all data records in the ALVIS XML format<br 
class="newline" /><span 
class="ectt-1095">combineExport </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">jobname cptest </span><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">--</span></span></span><span 
class="ectt-1095">profile alvis</span><br 
class="newline" />
     </li></ol>
<!--l. 313--><p class="indent" >   Running this crawler for an extended period will result in more than 200&#x00A0;000
records.
<!--l. 316--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.4.4   </span> <a 
 id="x8-170002.4.4"></a>Focused crawling in an Alvis system</h5>
<!--l. 317--><p class="noindent" >Use the same procedure as in section <a 
href="#x8-160002.4.3">2.4.3<!--tex4ht:ref: topicfocus --></a> (Focused crawling &#8211; topic specific) except for the last
point. Exporting should be done incrementally into an Alvis pipeline (in this example listening at
port 3333 on the machine nlp.alvis.info):<br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">combineExport</span><span 
class="ectt-1095">&#x00A0;--jobname</span><span 
class="ectt-1095">&#x00A0;cptest</span><span 
class="ectt-1095">&#x00A0;--pipehost</span><span 
class="ectt-1095">&#x00A0;nlp.alvis.info</span><span 
class="ectt-1095">&#x00A0;--pipeport</span><span 
class="ectt-1095">&#x00A0;3333</span><span 
class="ectt-1095">&#x00A0;--incremental</span></span></span>

<!--l. 322--><p class="noindent" >
   <h5 class="subsubsectionHead"><span class="titlemark">2.4.5   </span> <a 
 id="x8-180002.4.5"></a>Crawl one entire site and it&#8217;s outlinks</h5>
<!--l. 323--><p class="noindent" >This scenario requires the crawler to:
     <ul class="itemize1">
     <li class="itemize">crawl an entire target site
     </li>
     <li class="itemize">crawl all the outlinks from the site
     </li>
     <li class="itemize">crawl no other site or URL apart from external URLs mentioned on the one target
     site</li></ul>
<!--l. 331--><p class="indent" >   I.e. all of <span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">http://my.targetsite.com/*</span></span></span>, plus any other URL that is linked to from a page in
<span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">http://my.targetsite.com/*</span></span></span>.
<!--l. 335--><p class="indent" >
     <ol  class="enumerate1" >
     <li 
  class="enumerate" id="x8-18002x1">Configure Combine to crawl this one site only. Change the <span 
class="ectt-1095">&#x003C;allow&#x003E; </span>part in<br 
class="newline" /><span 
class="ectt-1095">/etc/combine/XXX/combine.cfg </span>to

     <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
     #use&#x00A0;either&#x00A0;URL&#x00A0;or&#x00A0;HOST:&#x00A0;(obs&#x00A0;&#8217;:&#8217;)&#x00A0;to&#x00A0;match&#x00A0;regular&#x00A0;expressions&#x00A0;to&#x00A0;either&#x00A0;the
     &#x00A0;<br />#full&#x00A0;URL&#x00A0;or&#x00A0;the&#x00A0;HOST&#x00A0;part&#x00A0;of&#x00A0;a&#x00A0;URL.
     &#x00A0;<br />&#x003C;allow&#x003E;
     &#x00A0;<br />#Allow&#x00A0;crawl&#x00A0;of&#x00A0;URLs&#x00A0;or&#x00A0;hostnames&#x00A0;that&#x00A0;matches&#x00A0;these&#x00A0;regular&#x00A0;expressions
     &#x00A0;<br />HOST:&#x00A0;my\.targetsite\.com$
     &#x00A0;<br />&#x003C;/allow&#x003E;
</div>
     </td></tr></table>
     <!--l. 346--><p class="nopar" >
     </li>
     <li 
  class="enumerate" id="x8-18004x2">Crawl until you have the entire site (if it&#8217;s a big site you might want to do the changes
     suggested in FAQ no <a 
href="DocMainse8.html#x41-690137">7<!--tex4ht:ref: slowcrawl --></a>).
     </li>
     <li 
  class="enumerate" id="x8-18006x3">Stop crawling.
     </li>
     <li 
  class="enumerate" id="x8-18008x4">Change configuration <span 
class="ectt-1095">&#x003C;allow&#x003E; </span>back to allow crawling of any domain (which is the
     default).

     <table 
class="verbatim"><tr class="verbatim"><td 
class="verbatim"><div class="verbatim">
     &#x003C;allow&#x003E;
     &#x00A0;<br />#Allow&#x00A0;crawl&#x00A0;of&#x00A0;URLs&#x00A0;or&#x00A0;hostnames&#x00A0;that&#x00A0;matches&#x00A0;these&#x00A0;regular&#x00A0;expressions
     &#x00A0;<br />HOST:&#x00A0;.*$
     &#x00A0;<br />&#x003C;/allow&#x003E;
</div>
     </td></tr></table>
     <!--l. 361--><p class="nopar" >
     </li>
     <li 
  class="enumerate" id="x8-18010x5">Schedule all links in the database for crawling, something like (change XXX to your
     jobname)<br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">echo</span><span 
class="ectt-1095">&#x00A0;&#8217;select</span><span 
class="ectt-1095">&#x00A0;urlstr</span><span 
class="ectt-1095">&#x00A0;from</span><span 
class="ectt-1095">&#x00A0;urls;&#8217;</span><span 
class="ectt-1095">&#x00A0;|</span><span 
class="ectt-1095">&#x00A0;mysql</span><span 
class="ectt-1095">&#x00A0;-u</span><span 
class="ectt-1095">&#x00A0;combine</span><span 
class="ectt-1095">&#x00A0;XXX</span></span></span><br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;</span><span 
class="ectt-1095">&#x00A0;|</span><span 
class="ectt-1095">&#x00A0;combineCtrl</span><span 
class="ectt-1095">&#x00A0;load</span><span 
class="ectt-1095">&#x00A0;--jobname</span><span 
class="ectt-1095">&#x00A0;XXX</span></span></span>
     </li>
     <li 
  class="enumerate" id="x8-18012x6">Change configuration to disable automatic recycling of links:<br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">#Enable(1)/disable(0)</span><span 
class="ectt-1095">&#x00A0;automatic</span><span 
class="ectt-1095">&#x00A0;recycling</span><span 
class="ectt-1095">&#x00A0;of</span><span 
class="ectt-1095">&#x00A0;new</span><span 
class="ectt-1095">&#x00A0;links</span></span></span><br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">AutoRecycleLinks</span><span 
class="ectt-1095">&#x00A0;=</span><span 
class="ectt-1095">&#x00A0;0</span></span></span>
     <!--l. 371--><p class="noindent" >and maybe (depending or your other requirements) change:<br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">#User</span><span 
class="ectt-1095">&#x00A0;agent</span><span 
class="ectt-1095">&#x00A0;handles</span><span 
class="ectt-1095">&#x00A0;redirects</span><span 
class="ectt-1095">&#x00A0;(1)</span><span 
class="ectt-1095">&#x00A0;or</span><span 
class="ectt-1095">&#x00A0;treat</span><span 
class="ectt-1095">&#x00A0;redirects</span><span 
class="ectt-1095">&#x00A0;as</span><span 
class="ectt-1095">&#x00A0;new</span><span 
class="ectt-1095">&#x00A0;links</span><span 
class="ectt-1095">&#x00A0;(0)</span></span></span><br 
class="newline" /><span class="obeylines-h"><span class="verb"><span 
class="ectt-1095">UserAgentFollowRedirects</span><span 
class="ectt-1095">&#x00A0;=</span><span 
class="ectt-1095">&#x00A0;0</span></span></span>
     </li>
     <li 
  class="enumerate" id="x8-18014x7">Start crawling and run until no more in queue.</li></ol>
<!--l. 1--><div class="crosslinks"><p class="noindent">[<a 
href="DocMainse1.html" >prev</a>] [<a 
href="DocMainse1.html#tailDocMainse1.html" >prev-tail</a>] [<a 
href="DocMainse2.html" >front</a>] [<a 
href="DocMainpa1.html#DocMainse5.html" >up</a>] </p></div>
<!--l. 1--><p class="indent" >   <a 
 id="tailDocMainse2.html"></a>    
</body></html>
	Global
`s`	Focus search bar
`?`	Bring up this help dialog
	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)
	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse
	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)