The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

X-Search -- Automated Web Searching and Search History Indexing

SYNOPSIS

use WWW::Search; X-Search [optional configuration file name/path argument]

Search commands are read from a configuration file.

DESCRIPTION

X-Search reads a series of search commands from a plain text configuration file and then retrieves the results from the specified search engine and stores them in individule dated files qid/YYYYMMDD.html which is a detailed web page record of the search results for the days date. Summaries of each search (if you have print summaries turned on) as well as a link history to each qid/YYYYMMDD.html file are maintained in one index.html file.

Any new search results for a 24 hour perioud are written to both the qid/YYYYMMDD.html and index.html files. If qid/YYYYMMDD.html already exists with previous search results for the date then it will be appended with newer results in a chronological order. If there is nothing new then nothing is written.

X-Search stores the url's from search results to a data file enabling it to track what it already has seen. This insures subsequent searches are unique and allows one to copy additional undesirable urls in blocks to this file to prevent X-Search from recording them if they are ever encountered in a future search (Filtering).

X-Search is ideal for maintaining records of frequent news events and can safely be run as many times as desired daily to determine new news events to index that matches the users search requirements. For instance: You could track any number of newsgroups three times daily for new posts by passing the search option "groups=". So, in the option field in the configuration file you could insert |groups=alt.some.group| or, if you wanted to search all groups related to perl you could do this: |groups=*perl*|

X-Search is ideal for web sites to present to their users detailed dated summaries of specific topics around the web that can change frequently. Thus, users are presented with the most current new additions as found in a pretty informative chronological order that relates to some subject matter. X-Search makes an ideal research tool for tracking and indexing latest additions.

X-Search Allows the option of verifying the url address to determine if it is valid or not. Any url's that are found not valid, i.e., moved, not found, are ignored.

X-Search allows one with a lot of flexibility to use in all sorts of neat applications.

(SEE =head1 REMOTE ADMINSTRATION for remote opertaion via a web browser)

CONFIGURATION FILE

X-Search is controled by a configuration file. This file can be any name you want. There are two methods to tell X-Search what configuration file to use and where.

Method 1: Simply define $qconfig in the script to point to the configuration file. By default it looks for a file called "query.ini" located in same directory as X-Search which should be fine for most.

Method 2: Command line argument defining path and name of the configuration file. Example:

X-Search /home/xsearch/search.conf

X-Search would read /home/xsearch/search.conf for it's search commands. This allows easily using multi configuration files for different search setups.

This file is read to get the following user defined search commands:

  1) The WWW::Search backend to use for the search 
  2) A nice Name description for the search topic to be
     printed within the web pages. This is like a headline. 
  3) The query search words for the search seperated by a space
  4) Any search options to pass to search engine. This is optional 
     and can be left blank.
  5) Max results to return
  6) The B<qid>, query information directory, the directory name 
     to create to store dated web pages created from the search.

A typical configuration file would have one or more lines that follow this structure:

SEARCH ENGINE|SEARCH NAME|SEARCH WORDS|OPTIONS|MAX_TO_RETURN|QID|

The individule values are seperated with a | and a | must be found at the end of each line. There is no limit to how many searches you can define in the configuration file, but you may want to keep it resonable and to aid in managing multi searches, there is the option of turning off/on summaries being displayed in index.html.

Here is a sample of what a typical configuration file should look like:

------cut------------------

HotBot|Military|tank armor|RD=DM&Domain=.mil|40|tanks| Google|Tech News|parallel processing||200|parallel| Excite::News|News From Home|Palm Springs California||100|myhome| AltaVista|AZ Fishing|arizona lakes fishing||60|lakes|

--------end-----------------

The Google command line would search the engine Google, print a nice list heading titled "Tech News", search and display results pertaining to "parallel processing", with no options, return a max of 200 results and store the dated search history pages in a directory called "parallel".

Obiviously, you want to define different qid names for all your different searches so that hot dog searches don't end up mixed with apple searches. But, at same time you can merge different searches to one date qid file as well. This is up to each user to determine for themselves.

Note About Options

Multiply option pairs must be seperated with '&'. See HotBot search example above.

Using the Administration Form built into X-Search makes all the above much easier to manage remotely from a browser.

REMOTE ADMINSTRATION

X-Search is capable of being run and configured remotely on a server via it's Administration form. This allows one to: a) remotely edit/add/remove search commands b) remotely execute X-Search manually in the event you do not have a need for or access to the cron function.

To use on a remote server you of course will need WWW::Search installed and available on that server. Before uploading X-Search to your server you will need to set the path within the X-Search script as to where the index directory will be created and this should be the absolute path to your root directory. Example on a RedHat system you would enter "/home/httpd/html/xsearch" (no trailing "/" slash) or some other directory name other than "xsearch" if you prefer. Then you can just use http://myaddress.com/xsearch to access your index.html page.

You will need to chmod X-Search.pl to 755 as well as the cgi-bin directory itself to work properly under Unix once you have uploaded to your server. My cgi-bin directory was not 755 and it did not work right till I chmod it to 755.

Win32 users can get away with doing nothing and X-Search would just by default build off the cgi-bin without any problems.

If you followed all the above you can then enter admin mode by typing:

http://myaddress.com/cgi-bin/X-Search.pl?admin=show

You should then be presented with the X-Search Administration page.

If you already have created a configuration file it will be displayed in the text area of the page, if not, it will be blank till you add some search commands. Creating search commands for X-Search is pretty easy in admin mode since I provide you with a template to fill out that will add the right syntax to the configuration file. More experienced users can use the text area to directly edit their configuration file. You can edit, add and remove pre-existing command lines this way remotely.

At the bottom of the administration page is a button to run X-Search. This way you can remotely execute X-Search in a timely manner through your browser, say once a week. After the script is completed you can then navigate to the URL address of your X-Search index.html to view any new search additions.

There is also a qid maintenance button that allows for viewing and removing qid directories. Unused directories undoubtly build up over time and this is a good way to remove them from disk.

AUTO SEARCHING

X-Search can be run from a cron job to automate searching even more.

Example to run X-Search each Monday at 3:00 AM:

    0 3 * * 1 /home/xsearch/X-Search

or if you want to specify a configuration file:

    0 3 * * 1 /home/xsearch/X-Search /home/xsearch/cofig.conf

CHANGING THE APPERANCE OF THE WEB PAGES

X-Search web pages are easily customizable by simply changing the html in the subs "print_ihead" and "print_dhead". The sub print_idead produces the html for the index.html file. You can add whatever body tags you desire like background colors, images, fonts, etc. The sub print_dhead controls the html that goes into qid/YYYYMMDD.html files.

There is also a "print_footer" sub that prints a footer for all the pages, and I ask that my name and e-mail address remain intact if you decide to customize the footer as well. (Publicity is my only payment from this :-)

USER SETTINGS

There is a number of user settings that control the behavior of X-Search which is hard coded into the script.

$verbose

This just prints messages to screen while the script is running. This is nice for manual operation but not needed if run by cron.

$ck_url

$ck_url = "1";

$ck_url will verify if url's are good or bad. 0=No 1=Yes Setting $ck_url can slow the search down depending on how many bad urls are encountered.

$iDIR

$iDIR = "c:/server/root/html/xsearch";

Full absolute directory path/name to store the main index.html file. qid directories will be created below this directory. For manual command line operation you can just define this as "./xsearch" to create a directory name "xsearch" under where you execute X-Search.pl.

REMOTE SERVER CONSIDERATIONS

Running remotely $iDIR should be pointed to the root directory, for example on RedHat you should define the path as:

"/home/httpd/html/xsearch"

In this way the url address to your index.html page would be http://myaddress.com/xsearch/index.html. Of course, "xsearch" can be any name you desire for the directory.

$index_url

$index_url = "http://127.0.0.1/xsearch/";

This is optional and used for remote administration to print a link to your index.html directory so you have a link to click after you have executed X-Search from your browser.

$print_summaries

$print_summaries = "0";

1=Yes 0=No

If you have many search events defined and running you may want to turn off printing summary results to keep the index.html file size within reason. Only links to the detailed qid/YYYYMMDD.html pages will be printed. Turn it on if you want summaries to be displayed in index.html

$oURLS

$oURLS = "urls.dat";

Define the name of our url's record file. Without this we are lost.

$qconfig

$qconfig = "query.ini";

Define path and name of the query configuration file. This file stores the search command, such as engines to use, search string, qid directory, max to return and so forth. You MAY also pass this value as a arugument so you can run multi configuration files by defining their name and path as a commandline argument.

$host $port

$port = ""; $host = "";

Define a host/port if required (most don't need to)

$sTEMP

$sTEMP = "TEMP";

This just defines a name for temporary working file X-Search uses to build a index.html file. No need to mess with it.

CHANGES

Version 1.06

 - Added a admin qid directory maintenance function so users can
   delete unwanted qid directories or simply list what qid directories
   have been created. Minor misc. Admin functions tweaking especially
   under win32.

Version 1.04

 - Added remote administration so one can run X-Search or edit the
   configuration file through their web browser. This allows one
   to run X-Search as a CGI script on their server.

Version 1.03

 - Created a hack to track Dejanews articles properly
 - added escaping and unescaping ?'s in urls because 
   they would raise havoc with my regex leading to urls
   being printed over and over

AUTHOR

X-Search was written entierly by Jim Smyser <<jsmyser@bigfoot.com><gt>.

BUGS

X-Search only been tested under RedHat and NT, it is unknown whether it will function under other OS's.

COPYRIGHT

Copyright (c) 2000 by Jim Smyser All rights reserved.

You my use this program source provided that the above copyright notice and this paragraph are duplicated in all such forms and that any documentation, advertising materials, and other materials related to your distribution of this source code and use acknowledge Jim Smyser as the author/developer.

THIS SOFTWARE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 957:

'=item' outside of any '=over'

Around line 1038:

You forgot a '=back' before '=head1'