Dezi::Tutorial - getting started with the Dezi search platform
Install the Dezi server from CPAN:
% cpan -i Dezi
Install the Dezi client from CPAN:
% cpan -i Dezi::Client
Start the Dezi server:
% dezi
In a separate terminal, add a small test document to the index:
% echo '<doc><title>bar</title>hello world</doc>' > test.xml % dezi-client test.xml
Search the index to confirm your test document worked:
% dezi-client -q bar
The Intermediate tutorial details the specifics behind the Dezi demo available at http://dezi.org/demo.
The Reuters News Corpus for Text Classification (Reuters-21578) is a common document corpus used for information retrieval projects. Other document collections have become more popular since the Reuters corpus first appeared (e.g. Wikipedia database) but the Reuters corpus is a nice, medium sized collection for demonstrating Dezi.
You can find the corpus many places on the internet. The version used for the demo came from http://svn.peknet.com/search_bench/. The 2xml.pl script at that URL will convert the original SGML documents to valid XML and split them into about 21k individual documents.
2xml.pl
Unpack the tar.gz file somewhere and run the 2xml.pl script as described in the script's comments.
As described in Dezi::Architecture, Dezi is based on Swish3 http://swish-e.org/swish3/. You can index the Reuters corpus with the swish3 command that comes with SWISH::Prog (one of the Dezi dependencies).
First, you'll need a configuration file. Here's the one used for the Dezi demo:
DefaultContents XML* StoreDescription XML* <text> 10000 PropertyNameAlias swishtitle title MetaNames dates topics people places orgs author swishdocpath PropertyNames dates topics people places orgs author dateline FuzzyIndexingMode Stemming_en1
Save the file as swish.conf.
swish.conf
More details on Swish3 configuration can be found at http://swish-e.org/docs/swish-config.html.
If your Reuters docs are in a directory called reuters, you can create an index with a command like:
reuters
% swish3 -c swish.conf -F lucy -f dezi.index -i reuters
You can index all kinds of document types, not just XML, but for the purposes of this tutorial, we'll keep it simple.
Here's the contents of the demo config file, named dezi.config.pl:
dezi.config.pl
{ engine_config => { facets => { names => [qw( topics people places orgs author )] }, }, ui_class => 'Dezi::UI', base_uri => 'http://dezi.org/demo', username => 'deziuser', password => 'a-secret', }
NOTE that the username/password is there to prevent unwanted modification of the index. Since Dezi supports POST, PUT and DELETE HTTP actions on an index, it's a good idea to protect an index, particularly if it is on the open internet.
NOTE too the Dezi::UI class is enabled. That requires a separate installation from CPAN.
Dezi::UI
% cpan -i Dezi::UI
% dezi --dezi-config dezi.config.pl
From a separate terminal, you can search the index containing text from the Reuters corpus:
% dezi-client -q 'some words'
Thanks to the Dezi::UI module, you can also search via a web browser. Assuming you are running the demo on a local machine, you can point your browser at http://localhost:5000/ui and explore the index contents graphically.
% cat indexer.pl #!/usr/bin/env perl use strict; use warnings; use Dezi::Client; use File::Find; my $client = Dezi::Client->new( server => 'http://localhost:5000' ); find({ wanted => \&add_to_index, follow => 1, no_chdir => 1, }, @ARGV); my $resp = $client->commit(); print $resp->content; sub add_to_index { my $file = $File::Find::name; # we only want .xml files return unless $file =~ m/\.xml$/; my $resp = $client->index($file); if (!$resp->is_success) { die "Failed to index $file: " . $resp->status_line; } }
In a separate terminal:
% perl indexer.pl path/to/xml/docs
After you're done indexing, look for something:
% dezi-client -q foo
Peter Karman, <karman at cpan.org>
<karman at cpan.org>
Please report any bugs or feature requests to bug-dezi at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Dezi. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-dezi at rt.cpan.org
You can find this documentation with the perldoc command.
perldoc Dezi::Tutorial
You can also look for information at:
Website
http://dezi.org/
IRC
#dezisearch at freenode
Mailing list
https://groups.google.com/forum/#!forum/dezi-search
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Dezi
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Dezi
CPAN Ratings
http://cpanratings.perl.org/d/Dezi
Search CPAN
http://search.cpan.org/dist/Dezi/
Copyright 2011 Peter Karman.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
Dezi::Client, Search::OpenSearch, SWISH::3, SWISH::Prog::Lucy, Plack, Lucy
To install Dezi, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Dezi
CPAN shell
perl -MCPAN -e shell install Dezi
For more information on module installation, please visit the detailed CPAN module installation guide.