Benjamin Franz > Search-InvertedIndex > Search::InvertedIndex

Download:
Search-InvertedIndex-1.14.tar.gz

Dependencies

Annotate this POD

Related Modules

Class::DBI
HTML::Index
Text::Balanced
HTML::Parser
Params::Validate
HTML::Mason
Lingua::Stem
IO::Select
Tie::IxHash
Search::Binary
more...
By perlmonks.org

CPAN RT

New  2
Open  1
View/Report Bugs
Module Version: 1.14   Source  

NAME ^

Search::InvertedIndex - A manager for inverted index maps

SYNOPSIS ^

   use Search::InvertedIndex;

   my $database = Search::InvertedIndex::DB::DB_File_SplitHash->new({
                          -map_name => '/www/search-engine/databases/test-maps/test',
                                 -multi => 4,
                         -file_mode => 0644,
                         -lock_mode => 'EX',
                  -lock_timeout => 30,
                -blocking_locks => 0,
                         -cachesize => 1000000,
                 -write_through => 0,
           -read_write_mode => 'RDWR';
                 });

   my $inv_map = Search::Inverted->new({ -database => $database });

 ##########################################################
 # Example Update
 ##########################################################

   my $index_data = "Some scalar - complex structure refs are ok";

   my $update = Search::InvertedIndex::Update->new({
                                                            -group => 'keywords',
                                                            -index => 'http://www.nihongo.org/',
                                                             -data => $index_data,
                                                             -keys => {
                                                                         'some' => 10,
                                                                       'scalar' => 20,
                                                                      'complex' => 15,
                                                                    'structure' => 15,
                                                                         'refs' => 15,
                                                                          'are' => 15,
                                                                           'ok' => 15,
                                                                      },
                                                                      });
   my $result = $inv_map->update({ -update => $update });

 ##########################################################
 # Example Query
 # '-nodes' is an anon list of Search::InvertedIndex::Query
 # objects (this allows constructing complex booleans by
 # nesting).
 #
 # '-leafs' is an anon list of Search::InvertedIndex::Query::Leaf
 # objects (used for individual search terms).
 #
 ##########################################################

   my $query_leaf1 = Search::InvertedIndex::Query::Leaf->new({
                                                                       -key => 'complex',
                                                                     -group => 'keywords',
                                                                    -weight => 1,
                                                                    });

   my $query_leaf2 = Search::InvertedIndex::Query::Leaf->new({
                                                                       -key => 'structure',
                                                                     -group => 'keywords',
                                                                    -weight => 1,
                                                                    });
   my $query_leaf3 = Search::InvertedIndex::Query::Leaf->new({
                                                                       -key => 'gold',
                                                                     -group => 'keywords',
                                                                    -weight => 1,
                                                                    });
   my $query1 = Search::InvertedIndex::Query->new({
                                          -logic => 'and',
                                         -weight => 1,
                                          -nodes => [],
                                          -leafs => [$query_leaf1,$query_leaf2],
                                   });
   my $query2 = Search::InvertedIndex::Query->new({
                                          -logic => 'or',
                                         -weight => 1,
                                          -nodes => [$query1],
                                          -leafs => [$query_leaf3],
                                   });

   my $result = $inv_map->search({ -query => $query2 });

 ##########################################################

   $inv_map->close;

DESCRIPTION ^

Provides the core of an inverted map based search engine. By mapping 'keys' to 'indexes' it provides ultra-fast look ups of all 'indexes' containing specific 'keys'. This produces highly scalable behavior where thousands, or even millions of records can be searched extremely quickly.

Available database drivers are:

 Search::InvertedIndex::DB::DB_File_SplitHash
 Search::InvertedIndex::DB::Mysql

Check the POD documentation for each database driver to determine initialization requirements.

CHANGES ^

 1.00 1999.06.16 - Initial release

 1.01 1999.06.17 - Documentation fixes and fix to 'close' method in
                                   Search::InvertedIndex::DB::DB_File_SplitHash

 1.02 1999.06.18 - Major bugfix to locking system.
                                   Performance tweaking. Roughly 3x improvement.

 1.03 1999.06.30 - Documentation fixes.

 1.04 1999.07.01 - Documentation fixes and caching system bugfixes.

 1.05 1999.10.20 - Altered ranking computation on search results

 1.06 1999.10.20 - Removed 'use attrs' usage to improve portability

 1.07 1999.11.09 - "Cosmetic" changes to avoid warnings in Perl 5.004

 1.08 2000.01.25 - Bugfix to 'Search::InvertedIndex::DB:DB_File_SplitHash' submodule
                                   and documentation additions/fixes

 1.09 2000.03.23 - Bugfix to 'Search::InvertedIndex::DB:DB_File_SplitHash' submodule
                                   to manage case where 'open' is not performed before close is called.

 1.10 2000.07.05 - Delayed loading of serializer and added option to select
                                   which serializer (Storable or Data::Dumper) to use at instance 'new' time.
                                   This should allow module to be loaded by mod_perl via the 'PerlModule'
                                   conf directive and enable use on platforms that do not support
                                   'Storable' (such as Macintosh).

 1.11 2000.11.29 - Added 'Search::InvertedIndex::DB::Mysql' (authored by
                                   Michael Cramer <cramer@webkist.com>) database driver
                                   to package.

 1.12 2002.04.09 - Squashed bug in removal of an index from a group when the index doesn't
                                   exist in that group that caused index counts for the group to be decremented
                                   in error.

 1.13 2003.09.28 - Interim release. Fixed false error return from 'first_key_in_group' for a group
                   that has not yet had any keys set.  Tightened calling
                   parm parses. Tweaked performance of preload updating code.
                   Added taint fix for stringifier identifier.
                   This release was driven by the taint issue and code bug as crisis items.
                   Hopefully a 1.14 release will be in the not too distant future.

 1.14 2003.11.14 - Patch to the MySQL database driver to accomodate changes in DBD::mysql.
                   Addition of a test for MySQL functionality. Patch and test thanks to
                   Kate L Pugh.

Public API

new({ -database => $database_object [,'-search_cache_size' => 1000, -search_cache_dir => '/var/tmp/search_cache', -stringifier => ['Storable','Data::Dumper'], ] });

Provides the interface for obtaining a new Search::InvertedIndex object for manipulating a inverted database.

Example 1:

 my $database = Search::InvertedIndex::DB::DB_File_SplitHash->new({
                         -map_name => '/www/databases/test-map_names/test',
                                -multi => 4,
                        -file_mode => 0644,
                        -lock_mode => 'EX',
                 -lock_timeout => 30,
           -blocking_locks => 0,
                        -cachesize => 1000000,
                -write_through => 0,
          -read_write_mode => 'RDONLY',
                });

 my $inv_map = Search::InvertedIndex->new({
                                '-database' => $database,
           '-search_cache_size' => 1000,
                '-search_cache_dir' => '/var/tmp/search_cache',
                           -stringifier => ['Storable','Data::Dumper'],
         });

Parameter explanations:

  -database          - A database interface object. Defined database interfaces
                                           are currently Search::InvertedIndex::DB::DB_File_SplitHash
                                           and Search::InvertedIndex::DB::Mysql. (Required)

  -stringifier       - Declares the stringifier used to store information in the
                                           underlaying database. Currently defined stringifiers are
                                           'Storable' and 'Data::Dumper'. The default is to use
                                           'Storable' with fallback to 'Data::Dumper'. (Optional)

  -search_cache_size - Sets the number of cached searched to hold in the search cache (Optional)

  -search_cache_dir  - Sets the directory to be used for the search cache
                                           (Required if search_cache_size is set to something other than 0)

The -database parameter is required and must be a 'Search::InvertedIndex::DB::...' type database object. The other two parameters are optional and define the location and size of the search cache. If omitted, no search caching will be done.

The optional '-stringifier' parameter can be used to override the default use of 'Storable' (with fallback to 'Data::Dumper') as the stringifier used for storing data by the module. Specifiying -stringifier => 'Data::Dumper' would specify using 'Data::Dumper' (only) as the stringifier while specifiying -stringifier => ['Data::Dumper','Storable'] would specify to use Data::Dumper by preference (but to fall back to 'Storable' if Data::Dumper was not available). If a database was created using a particular serializer, it will automatically detect it and attempt to use the correct one.

lock({ -lock_mode = 'EX|SH|UN' [, -lock_timeout => 30] [, -blocking_locks => 0] });>

Changes a lock on the underlaying database.

Forces 'sync' if the stat is changed from 'EX' to a lower lock state (i.e. 'SH' or 'UN'). Croaks on errors.

Example:

        $inv->lock({ -lock_mode => 'EX' [, -lock_timeout => 30] [, -blocking_locks => 0],
                  });

The only _required_ parameter is the -lock_mode. The other parameters can be inherited from the object state. If the other parameters are used, they change the object state to match the new settings.

status(-open|-lock_mode);

Returns the requested status line for the database. Allowed requests are '-open', and '-lock'.

Example 1: my $status = $inv_map->status(-open); # Returns either '1' or '0'

Example 2: my $status = $inv_map->status(-lock_mode); # Returns 'UN', 'SH' or 'EX'

update({ -update => $update });

Performs an update on the map. This is designed for adding/changing/deleting a bunch of related information in a single block update. It takes a Search::InvertedIndex::Update object as input. It assumes that you wish to remove all references to the specified index and replace them with a new list of references. It can also will update the -data for the -index. If -data is passed and the -index does not already exist, a new index record will be created. It is a fatal error to pass a non-existant index without a -data parm to initialize it. It is also a fatal error to pass an update for a non-existant -group.

Passing an empty -keys has the effect of deleting the index from group (but not from the system).

Example:

 my $update = Search::InvertedIndex::Update->new(...);
 $inv_map->update({ -update => $update });

It is much faster to update a index using the update method than the add_entry_to_group method in most cases because the batching of changes allows for efficiency optimizations when there is more than one key.

preload_update({ -update => $update });

'preload_update' places the passed 'update' object data into a pending queue which is not reflected in the searchable database until the 'update_group' method has been called. This allows the loading process to be streamlined for maximum performance on large full updates. This method is not appropriate to incremental updates as the 'update_group' method destroys the previous searchable data set on execution.

It also places the database effectively offline during the update, so this is not a suitable method for updating a 'online' database. Updates should happen on an 'offline' copy that is then swapped into place with the 'online' database.

Example:

 my $update = Search::InvertedIndex::Update->new(...);
 $inv_map->preload_update({ -update => $update });
                .
                .
                .
 $inv_map->update_group({ -group => 'test' });
clear_preload_update_for_group({ -group => $group });

This clears all the data from the preload area for the specified group.

update_group({ -group => $group[, -block_size => 65536] });

This clears the specifed group and loads all preloaded data (updates batch loaded through the 'preload_update' method pending finalization.

This is by far the fastest way to load a large set of data into the search system - but it is an 'all or nothing' approach. No 'incremental' updating is possible via this interface - the update_group completely erases all previously searchable data from the group and replaces it with the pending 'preload'ed data.

Examples:

  $inv_map->update_group({ -group => 'test' });

  $inv_map->update_group({ -group => 'test', -block_size => 65536 });

-block_size determines the 'chunking factor' used to limit the amount of memory the update uses (it corresponds roughly to the number of line entry items to be processed in memory at one time). Higher '-block_size's will improve performance until you run out of real memory. The default is 65536.

Since an exclusive lock should be held during the entire process, the database is essentially inaccessible until the update is complete. It is probably inadvisable to use this method of updating without keeping an 'online' and a seperate 'offline' database and copy over the 'offline' to 'online' after completion of the mass update on the 'offline' database.

search({ -query => $query [,-cache => 1] });

Performs a query on the map and returns the results as a Search::InvertedIndex::Result object containing the keys and rankings.

Example:

 my $query = Search::InvertedIndex::Query->new(...);
 my $result = $inv_map->search({ -query => $query });

Performs a complex multi-key match search with boolean logic and optional search term weighting.

The search request is formatted as follows:

my $result = $inv_map->search({ -query => $query });

where '$query' is a Search::InvertedIndex::Query object.

Each node can either be a specific search term with an optional weighting term (a Search::InvertedIndex::Query::Leaf object) or a logic term with its own sub-branches (a Search::Inverted::Query object).

The weightings are applied to the returned matches for each search term by multiplication of their base ranking before combination with the other logic terms.

This allows recursive use of search to resolve arbitrarily complex boolean searches and weight different search terms.

The optional -cache parameter instructs the database to cache ( if the -search_cache_dir and -search_cache_size initialization parameters are configured for use) the search and results for performance on repeat searches. '1' means use the cache, '0' means do not.

data_for_index({ -index => $index });

Returns the data record for the passed -index. Returns undef if no matching -index is in the system.

Example: my $data = $self->data_for_index({ -index => $index });

clear_all;

Completely clears the contents of the database and the search cache.

clear_cache;

Completely clears the contents of the search cache.

close;

Closes the currently open -map and flushes all associated buffers.

number_of_groups;

Returns the raw number of groups in the system.

Example: my $n = $inv_map->number_of_groups;

number_of_indexes;

Returns the raw number of indexes in the system.

Example: my $n = $inv_map->number_of_indexes;

number_of_keys;

Returns the raw number of keys in the system.

Example: my $n = $inv_map->number_of_keys;

number_of_indexes_in_group({ -group => $group });

Returns the raw number of indexes in a specific group.

Example: my $n = $inv_map->number_of_indexes_in_group({ -group => $group });

number_of_keys_in_group({ -group => $group });

Returns the raw number of keys in a specific group.

Example: my $n = $inv_map->number_of_keys_in_group({ -group => $group });

add_group({ -group => $group });

Adds a new '-group' to the map. There is normally no need to call this method from outside the module. The addition of new -groups is done automatically when adding new entries.

Example: $inv_map->add_group({ -group => $group });

croaks if unable to successfuly create the group for some reason.

It silently eats attempts to create an existing group.

add_index({ -index => $index, -data => $data });

Adds a index entry to the system.

Example: $inv_map->add_index({ -index => $index, -data => $data });

If the 'index' is the same as an existing index, the '-data' for that index will be updated.

-data can be pretty much any scalar. strings/object/hash/array references are ok. They will be transparently serialized using Storable (preferred) or Data::Dumper.

This method should be called to set the '-data' record returned by searches to something useful. If you do not, you will have to maintain the information you want to show to users seperately from the main search engine core.

The method returns the index_enum of the index.

add_index_to_group({ -group => $group, -index => $index[, -data => $data] });

Adds an index entry to a group. If the index does not already exist in the system, adds it to the system as well.

Examples:

   $inv_map->add_index_to_group({ -group => $group, '-index' => $index});

   $inv_map->add_index_to_group({ -group => $group, '-index' => $index, -data => $data});

Returns the 'index_enum' for the index record.

If the 'index' is the same as an existing key, the 'index_enum' of the existing index will be returned.

There is normally no need to call this method directly. Addition of index to groups is handled automatically during addition of new entries.

It cannot be used to add index to non-existant groups. This is a feature not a bug.

The -data parameter is optional

add_key_to_group({ -group => $group, -key => $key });

Adds a key entry to a group.

Example: $inv_map->_add_key({ -group => $group, -key => $key });

Returns the 'key_enum' for the key record.

If the 'key' is the same as an existing key, the 'key_enum' of the existing key will be returned.

There is normally no need to call this method directly. Addition of keys to groups is handled automatically during addition of new entries.

It cannot be used to add keys to non-existant groups. This is a feature not a bug.

add_entry_to_group({ -group => $group, -key => $key, -index => $index, -ranking => $ranking });

Adds a reference to a particular index for a key with a ranking to a specific group.

Example: $inv_map->add_entry_to_group({ -group => $group, -key => $key, -index => $index, -ranking => $ranking });

This method cannot be used to create new -indexes or -groups. This is a feature, not a bug. It *will* create new -keys as needed.

remove_group({ -group => $group });

Remove all entries for a group from the map.

Example: $inv_map->remove_group({ -group => $group });

This removes all key and key/index entries for the group and all other group specific data from the map.

Use this method when you wish to completely delete a searchable 'group' from the map without disturbing other existing groups.

remove_entry_from_group({ -group => $group, -key => $key, -index => $index });

Remove a specific key<->index entry from the map for a group.

Example: $inv_map->remove_entry_from_group({ -group => $group, -key => $key, -index => $index });

Does not remove the -key or -index from the database or the group - only the entries mapping the two to each other.

remove_index_from_group ({ -group => $group, -index => $index });

Remove all references to a specific index for all keys for a group.

Example: $inv_map->_remove_index_from_group({ -group => $group, -index => $index });

Note: This *does not* remove the index from the _system_ - just a specific group.

It is a null operation to remove an undeclared index or to remove a declared index from a group where it is not used.

remove_index_from_all ({ -index => $index });

Remove all references to a specific index from the system.

Example: $inv_map->_remove_index_from_all({ -index => $index });

This *completely* removes it from all groups and the master system entries.

It is a null operation to remove an undefined index.

remove_key_from_group({ -group => $group, -key => $key });

Remove all references to a specific key for all indexes for a group.

Example: $inv_map->remove({ -group => $group, -key => $key });

Returns undef if the key speced was not even in database. Returns '1' if the key speced was in the database, and has been successfully deleted.

croaks on errors.

list_all_keys_in_group({ -group => $group });

Returns an anonymous array containing a list of all defined keys in the specified group.

Example: $keys = $inv_map->list_all_keys_in_group({ -group => $group });

Note: This can result in *HUGE* returned lists. If you have a lot of records in the group, you are better off using the iteration support ('first_key_in_group', 'next_key_in_group').

first_key_in_group({ -group => $group_name });

Returns the 'first' key in the -group based on hash ordering.

Returns 'undef' if there are no keys in the group.

Example: my $first_key = $inv_map->first_key_in_group({-group => $group});

next_key_in_group({ -group => $group, -key => $key });

Returns the 'next' key in the group based on hash ordering.

Returns 'undef' when there are no more keys in the group or if the passed -key is not in the group map.

Example: my $next_key = $inv_map->next_key_in_group({ -group => $group, -key => $key });

list_all_indexes_in_group({ -group => $group });

Returns an anonymous array containing a list of all defined indexes in the group

Example: $indexes = $inv_map->list_all_indexes_in_group({ -group => $group });

Note: This can result in *HUGE* returned lists. If you have a lot of records in the group, you are better off using the iteration support (first_index_in_group(), next_index_in_group())

first_index_in_group;

Returns the 'first' index in the -group based on hash ordering. Returns 'undef' if there are no indexes in the group.

Example: my $first_index = $inv_map->first_index_in_group({ -group => $group });

next_index_in_group({-group = $group, -index => $index});>

Returns the 'next' index in the -group based on hash ordering. Returns 'undef' if there are no more indexes.

Example: my $next_index = $inv_map->next_index_in_group({-group => group, -index => $index});

list_all_indexes;

Returns an anonymous array containing a list of all defined indexes in the map.

Example: $indexes = $inv_map->list_all_indexes;

Note: This can result in *HUGE* returned lists. If you have a lot of records in the map or do not have a lot memory, you are better off using the iteration support ('first_index', 'next_index')

first_index;

Returns the 'first' index in the system based on hash ordering. Returns 'undef' if there are no indexes.

Example: my $first_index = $inv_map->first_index;

next_index({-index => $index});

Returns the 'next' index in the system based on hash ordering. Returns 'undef' if there are no more indexes.

Example: my $next_index = $inv_map->next_index({-index => $index});

list_all_groups;

Returns an anonymous array containing a list of all defined groups in the map.

Example: $groups = $inv_map->list_all_groups;

If you have a lot of groups in the map or do not have a lot of memory, you are better off using the iteration support ('first_group', 'next_group')

first_group;

Returns the 'first' group in the system based on hash ordering. Returns 'undef' if there are no groups.

Example: my $first_group = $inv_map->first_group;

next_group ({-group => $group });

Returns the 'next' group in the system based on hash ordering. Returns 'undef' if there are no more groups.

Example: my $next_group = $inv_map->next_group({-group => $group});

VERSION ^

1.14

COPYRIGHT ^

Copyright 1999-2002, Benjamin Franz (<URL:http://www.nihongo.org/snowhare/>) and FreeRun Technologies, Inc. (<URL:http://www.freeruntech.com/>). All Rights Reserved. This software may be copied or redistributed under the same terms as Perl itelf.

AUTHOR ^

Benjamin Franz

TODO ^

Integrate code and documentation patches from Kate Pugh. Seperate POD into .pod files.

Concept item for evaluation: By storing a dense list of all indexed keywords, you would be able to use a regular expression or other fuzzy search matching scheme comparatively efficiently, locate possible words via a grep and then search on the possibilities. Seems to make sense to implement that as _another_ module that uses this module as a backend. 'Search::InvertedIndex::Fuzzy' perhaps.

SEE ALSO ^

 Search::InvertedIndex::Query  Search::InvertedIndex::Query::Leaf
 Search::InvertedIndex::Result Search::InvertedIndex::Update
 Search::InvertedIndex::DB::DB_File_SplitHash
 Search::InvertedIndex::DB::Mysql
syntax highlighting: