ElasticSearch - An API for communicating with ElasticSearch
Version 0.13, tested against ElasticSearch server version 0.7.0.
ElasticSearch is an Open Source (Apache 2 license), distributed, RESTful Search Engine based on Lucene, and built for the cloud, with a JSON API.
Check out its features: http://www.elasticsearch.com/products/elasticsearch/
This module is a thin API which makes it easy to communicate with an ElasticSearch cluster.
It maintains a list of all servers/nodes in the ElasticSearch cluster, and spreads the load randomly across these nodes. If the current active node disappears, then it attempts to connect to another node in the list.
Forking a process triggers a server list refresh, and a new connection to a randomly chosen node in the list.
use ElasticSearch; my $e = ElasticSearch->new( servers => 'search.foo.com', debug => 1, trace_calls => 'log_file', ); $e->index( index => 'twitter', type => 'tweet', id => 1, data => { user => 'kimchy', post_date => '2009-11-15T14:12:12', message => 'trying out Elastic Search' } ); $data = $e->get( index => 'twitter', type => 'tweet', id => 1 ); $results = $e->search( index => 'twitter', type => 'tweet', query => { term => { user => 'kimchy' }, } );
You can download the latest released version of ElasticSearch from http://github.com/elasticsearch/elasticsearch/downloads.
To build the latest development version from source, you can do the following:
cd ~ rm -Rf elasticsearch* wget http://github.com/elasticsearch/elasticsearch/tarball/master tar -xzf elasticsearch*.gz cd elasticsearch* ./gradlew cd /path/where/you/want/elasticsearch unzip ~/elasticsearch/build/distributions/elasticsearch*
To start a test server in the foreground:
./bin/elasticsearch -f
You can start multiple servers by repeating this command - they will autodiscover each other.
More instructions are available here: http://www.elasticsearch.com/docs/elasticsearch/setup/installation
I've tried to follow the same terminology as used in the ElasticSearch docs when naming methods, so it should be easy to tie the two together.
Some methods require a specific index and a specific type, while others allow a list of indices or types, or allow you to specify all indices or types. I distinguish between them as follows:
index
type
$e->method( index => multi, type => single, ...)
multi values can be:
multi
index => 'twitter' # specific index index => ['twitter','user'] # list of indices index => undef # (or not specified) = all indices
single values must be a scalar, and are required parameters
single
type => 'tweet'
Methods that query the ElasticSearch cluster return the raw data structure that the cluster returns. This may change in the future, but as these data structures are still in flux, I thought it safer not to try to interpret.
Anything that is know to be an error throws an exception, eg trying to delete a non-existent index.
new()
$e = ElasticSearch->new( servers => '127.0.0.1:9200' # single server | ['es1.foo.com:9200', 'es2.foo.com:9200'], # multiple servers debug => 1 | 0, ua_options => { LWP::UserAgent options}, );
servers is a required parameter and can be either a single server or an ARRAY ref with a list of servers. These servers are used to retrieve a list of all servers in the cluster, after which one is chosen at random to be the "current_server()".
servers
See also: "debug()", "ua_options()", "refresh_servers()", "servers()", "current_server()"
index()
$result = $e->index( index => single, type => single, id => $document_id, # optional, otherwise auto-generated data => { key => value, ... }, timeout => eg '1m' or '10s' # optional create => 1 |0 # optional );
eg:
$result = $e->index( index => 'twitter', type => 'tweet', id => 1, data => { user => 'kimchy', post_date => '2009-11-15T14:12:12', message => 'trying out Elastic Search' }, );
Used to add a document to a specific index as a specific type with a specific id. If the index/type/id combination already exists, then that document is updated, otherwise it is created.
id
index/type/id
Note:
If the id is not specified, then ElasticSearch autogenerates a unique ID and a new document is always created.
If create is true, then a new document is created, even if the same index/type/id combination already exists! create can be used to slightly increase performance when creating documents that are known not to exists in the index.
create
true
See also: http://www.elasticsearch.com/docs/elasticsearch/rest_api/index and "put_mapping()"
set()
set() is a synonym for "index()"
create()
create is a synonym for "index()" but creates instead of first checking whether the doc already exists. This speeds up the indexing process.
get()
$result = $e->get( index => single, type => single, id => single, );
Returns the document stored at index/type/id or throws an exception if the document doesn't exist.
Example:
$e->get( index => 'twitter', type => 'tweet', id => 1) Returns: { _id => 1, _index => "twitter", _source => { message => "trying out Elastic Search", post_date=> "2009-11-15T14:12:12", user => "kimchy", }, _type => "tweet", }
See also: "KNOWN ISSUES", http://www.elasticsearch.com/docs/elasticsearch/rest_api/get
delete()
$result = $e->delete( index => single, type => single, id => single, );
Deletes the document stored at index/type/id or throws an exception if the document doesn't exist.
$e->delete( index => 'twitter', type => 'tweet', id => 1);
See also: http://www.elasticsearch.com/docs/elasticsearch/rest_api/delete
search()
$result = $e->search( index => multi, type => multi, query => {query}, search_type => $search_type # optional explain => 1 | 0 # optional facets => { facets } # optional fields => [$field_1,$field_n] # optional from => $start_from # optional size => $no_of_results # optional sort => ['score',$field_1] # optional scroll => '5m' | '30s' # optional highlight => { highlight } # optional indices_boost => { index_1 => 1.5,... } # optional );
Searches for all documents matching the query. Documents can be matched against multiple indices and multiple types, eg:
$result = $e->search( index => undef, # all type => ['user','tweet'], query => { term => {user => 'kimchy' }} );
For all of the options that can be included in the query parameter, see http://www.elasticsearch.com/docs/elasticsearch/rest_api/search and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/
query
scroll()
$result = $e->scroll(scroll_id => $scroll_id );
If a search has been executed with a scroll parameter, then the returned scroll_id can be used like a cursor to scroll through the rest of the results.
scroll
scroll_id
Note - this doesn't seem to work correctly in version 0.6.0 of ElasticSearch.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/search/#Scrolling
count()
$result = $e->count( index => multi, type => multi, bool | constantScore | disMax | field | filtered | flt | flt_field | match_all | mlt | mlt_field | query_string | prefix | range | term | wildcard );
Counts the number of documents matching the query. Documents can be matched against multiple indices and multiple types, eg
$result = $e->count( index => undef, # all type => ['user','tweet'], query => { term => {user => 'kimchy' }} );
See also "search()", http://www.elasticsearch.com/docs/elasticsearch/rest_api/count and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/
delete_by_query()
$result = $e->delete_by_query( index => multi, type => multi, bool | constantScore | disMax | field | filtered | flt | flt_field | match_all | mlt | mlt_field | query_string | prefix | range | term | wildcard );
Deletes any documents matching the query. Documents can be matched against multiple indices and multiple types, eg
$result = $e->delete_by_query( index => undef, # all type => ['user','tweet'], term => {user => 'kimchy' } );
See also "search()", http://www.elasticsearch.com/docs/elasticsearch/rest_api/delete_by_query and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/
terms()
$terms = $e->terms( index => multi, fields => 'field' | [field1..], # required min_freq => integer, # optional max_freq => integer, # optional size => integer, # optional sort => term (default) | freq # optional ## A range of terms eg alpha - omega from => 'first term', # optional to => 'last term', # optional from_inclusive => 1 | 0 # optional to_inclusive => 1 | 0 # optional prefix => 'prefix', # terms starting with regexp => 'regexp', # terms matching ^regexp$ );
Get terms (from one or more indices) and the number of times those terms appear in a document. Useful for generating tag clouds or for auto-suggestion.
$terms = $e->terms( index => 'twitter', fields => ['tweet','reply'], prefix => 'arnol' )
See also http://www.elasticsearch.com/docs/elasticsearch/rest_api/terms and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/
mlt()
# mlt == more_like_this $results = $e->mlt( index => single, # required type => single, # required id => $id, # required # optional more-like-this params boost_terms => float mlt_fields => 'scalar' or ['scalar_1', 'scalar_n'] max_doc_freq => integer max_query_terms => integer max_word_len => integer min_doc_freq => integer min_term_freq => integer min_word_len => integer pct_terms_to_match => float stop_words => 'scalar' or ['scalar_1', 'scalar_n'] # optional search params scroll => '5m' | '10s' search_type => "predefined_value" explain => {explain} facets => {facets} fields => {fields} from => {from} highlight => {highlight} size => {size} sort => {sort} )
More-like-this (mlt) finds related/similar documents. It is possible to run a search query with a moreLikeThis clause (where you pass in the text you're trying to match), or to use this method, which uses the text of the document referred to by index/type/id.
moreLikeThis
This gets transformed into a search query, so all of the search parameters are also available.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/more_like_this/ and http://www.elasticsearch.com/docs/elasticsearch/rest_api/query_dsl/moreLikeThis_query/
index_status()
$result = $e->index_status( index => multi, );
Returns the status of $result = $e->index_status(); #all $result = $e->index_status( index => ['twitter','buzz'] ); $result = $e->index_status( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/status
create_index()
$result = $e->create_index( index => single, defn => {...} # optional );
Creates a new index, optionally setting certain paramters, eg:
$result = $e->create_index( index => 'twitter', defn => { number_of_shards => 3, number_of_replicas => 2, } );
Throws an exception if the index already exists.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/create_index
delete_index()
$result = $e->delete_index( index => single );
Deletes an existing index, or throws an exception if the index doesn't exist, eg:
$result = $e->delete_index( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/delete_index
aliases()
$result = $e->aliases( actions => [actions] | {actions} )
Adds or removes an alias for an index, eg:
$result = $e->aliases( actions => [ { remove => { index => 'foo', alias => 'bar' }}, { add => { index => 'foo', alias => 'baz' }} ]);
actions can be a single HASH ref, or an ARRAY ref containing multiple HASH refs.
actions
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/aliases/
get_aliases()
$result = $e->get_aliases( index => multi )
Returns a hashref listing all indices and their corresponding aliases, and all aliases and their corresponding indices, eg:
{ aliases => { bar => ["foo"], baz => ["foo"], }, indices => { foo => ["baz", "bar"] }, }
If you pass in the optional index argument, which can be an index name or an alias name, then it will only return the indices and aliases related to that argument.
flush_index()
$result = $e->flush_index( index => multi );
Flushes one or more indices. The flush process of an index basically frees memory from the index by flushing data to the index storage and clearing the internal transaction log. By default, ElasticSearch uses memory heuristics in order to automatically trigger flush operations as required in order to clear memory.
$result = $e->flush_index( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/flush
refresh_index()
$result = $e->refresh_index( index => multi );
Explicitly refreshes one or more indices, making all operations performed since the last refresh available for search. The (near) real-time capabilities depends on the index engine used. For example, the robin one requires refresh to be called, but by default a refresh is scheduled periodically.
$result = $e->refresh_index( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/refresh
clear_cache()
$result = $e->clear_cache( index => multi );
Clears the caches for the specified indices (currently only the filter cache).
See http://github.com/elasticsearch/elasticsearch/issues/issue/101
gateway_snapshot()
$result = $e->gateway_snapshot( index => multi );
Explicitly performs a snapshot through the gateway of one or more indices (backs them up ). By default, each index gateway periodically snapshot changes, though it can be disabled and be controlled completely through this API.
$result = $e->gateway_snapshot( index => 'twitter' );
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/gateway_snapshot and http://www.elasticsearch.com/docs/elasticsearch/modules/gateway
snapshot_index()
snapshot_index() is a synonym for "gateway_snapshot()"
optimize_index()
$result = $e->optimize_index( index => multi, only_deletes => 1 | 0, # only_expunge_deletes flush => 1 | 0, # flush after optmization refresh => 1 | 0, # refresh after optmization )
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/optimize
put_mapping()
$result = $e->put_mapping( index => multi, type => single, _all => { ... }, properties => { ... }, # required timeout => '5m' | '10s', # optional ignore_conflicts => 1 | 0, # optional );
A mapping is the data definition of a type. If no mapping has been specified, then ElasticSearch tries to infer the types of each field in document, by looking at its contents, eg
mapping
'foo' => string 123 => integer 1.23 => float
However, these heuristics can be confused, so it safer (and much more powerful) to specify an official mapping instead, eg:
$result = $e->put_mapping( index => ['twitter','buzz'], type => 'tweet', properties => { user => {type => "string", index => "not_analyzed"}, message => {type => "string", nullValue => "na"}, post_date => {type => "date"}, priority => {type => "integer"}, rank => {type => "float"} } );
See also: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/indices/put_mapping and http://www.elasticsearch.com/docs/elasticsearch/mapping
get_mapping()
$mapping = $e->get_mapping( index => single, type => multi );
Returns the mappings for all types in an index, or the mapping for the specified type(s), eg:
$mapping = $e->get_mapping( index => 'twitter', type => 'tweet' ); $mappings = $e->get_mapping( index => 'twitter', type => ['tweet','user'] ); # { tweet => {mapping}, user => {mapping}}
The index argument must be an index name, and not an alias name.
cluster_state()
$result = $e->cluster_state();
Returns cluster state information.
See http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/state/
cluster_health()
$result = $e->cluster_health( index => multi, level => 'cluster' | 'indices' | 'shards', wait_for_status => 'red' | 'yellow' | 'green', | wait_for_relocating_shards => $number_of_shards, timeout => $seconds );
Returns the status of the cluster, or index|indices or shards, where the returned status means:
It can block to wait for a particular status (or better), or can block to wait until the specified number of shards have been relocated (where 0 means all).
If waiting, then a timeout can be specified.
For example:
$result = $e->cluster_health( wait_for_status => 'green', timeout => 10)
nodes()
$result = $e->nodes( nodes => multi, settings => 1 | 0 # optional );
Returns information about one or more nodes or servers in the cluster. If settings is true, then it includes the node settings information.
settings
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/nodes_info
nodes_stats()
$result = $e->nodes_stats( nodes => multi, );
Returns various statistics about one or more nodes in the cluster.
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/nodes_stats/
shutdown()
$result = $e->shutdown( nodes => multi, delay => '5s' | '10m' # optional );
Shuts down one or more nodes (or the whole cluster if no nodes specified), optionally with a delay.
node can also have the values _local, _master or _all.
node
_local
_master
_all
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/nodes_shutdown/
restart()
$result = $e->restart( nodes => multi, delay => '5s' | '10m' # optional );
Restarts one or more nodes (or the whole cluster if no nodes specified), optionally with a delay.
See: http://www.elasticsearch.com/docs/elasticsearch/rest_api/admin/cluster/nodes_restart
servers()
$servers = $e->servers
Returns a list of the servers/nodes known to be in the cluster the last time that "refresh_servers()" was called.
refresh_servers()
$e->refresh_servers( $server | [$server_1, ...$server_n]) $e->refresh_servers()
Tries to contact each server in the list to retrieve a list of servers/nodes currently in the cluster. If it succeeds, then it updates "servers()" and randomly selects one server to be the "current_server()"
If no servers are passed in, then it uses the list from "servers()" (ie the last known good list) instead.
Throws an exception if no servers can be found.
refresh_server is called from :
refresh_server
current_server()
$current_server = $e->current_server()
Returns the current server for the current PID, or if none is set, then it tries to get a new current server by calling "refresh_servers()".
ua()
$ua = $e->ua
Returns the current LWP::UserAgent instance for the current PID. If there is none, then it creates a new instance, with any options specified in "ua_options()"
Keep-alive is used by default (via LWP::ConnCache).
Keep-alive
ua_options()
$ua_options = $e->ua({....})
Get/sets the current list of options to be used when creating a new LWP::UserAgent instance. You may, for instance, want to set timeout
LWP::UserAgent
timeout
This is best set when creating a new instance of ElasticSearch with "new()".
JSON()
$json_xs = $e->JSON
Returns the current JSON::XS object which is used to encode and decode all JSON when communicating with ElasticSearch.
If you need to change the JSON settings you can do (eg):
$e->JSON->utf8
It is probably better not to fiddle with this! ElasticSearch expects all data to be provided as Perl strings (not as UTF8 encoded byte strings) and returns all data from ElasticSearch as Perl strings.
camel_case()
$bool = $e->camel_case($bool)
Gets/sets the camel_case flag. If true, then all JSON keys returned by ElasticSearch are in camelCase, instead of with_underscores. This flag does not apply to the source document being indexed or fetched.
Defaults to false.
request()
$result = $e->request({ method => 'GET|PUT|POST|DELETE', cmd => url, # eg '/twitter/tweet/123' data => $hash_ref # converted to JSON document })
The request() method is used to communicate with the ElasticSearch "current_server()". If any request fails with a Can't connect error, then request() tries to refresh the server list, and repeats the request.
Can't connect
Any other error will throw an exception.
throw()
$e->throw('ErrorType','ErrorMsg', {vars})
Throws an exception of ref $e . '::Error::' . $error_type, eg:
ref $e . '::Error::' . $error_type
$e->throw('Param', 'Missing required param', { params => $params})
... will thrown an error of class ElasticSearch::Error::Param.
ElasticSearch::Error::Param
Any vars passed in will be available as $error->{-vars}.
$error->{-vars}
If "debug()" is true, then $error->{-stacktrace} will contain a stacktrace.
$error->{-stacktrace}
debug()
$e->debug(1|0);
If debug() is true, then exceptions include a stack trace.
trace_calls()
$es->trace_calls(1); # log to STDERR $es->trace_calls($filename); # log to $filename $es->trace_calls(0 | undef); # disable logging
trace_calls() is used for debugging. All requests to the cluster are logged either to STDERR or the specified file, in a form that can be rerun with curl.
STDERR
The cluster response will also be logged, and commented out.
Example: $e->nodes() is logged as:
$e->nodes()
curl -XGET http://127.0.0.1:9200/_cluster/nodes # { # "clusterName" : "elasticsearch", # "nodes" : { # "getafix-24719" : { # "http_address" : "inet[/127.0.0.2:9200]", # "data_node" : true, # "transport_address" : "inet[getafix.traveljury.com/127.0. # > 0.2:9300]", # "name" : "Sangre" # }, # "getafix-17782" : { # "http_address" : "inet[/127.0.0.2:9201]", # "data_node" : true, # "transport_address" : "inet[getafix.traveljury.com/127.0. # > 0.2:9301]", # "name" : "Williams, Eric" # } # } # }
Clinton Gormley, <drtech at cpan.org>
<drtech at cpan.org>
If one of the fields that you are trying to index has the same name as the type, then you need change the format as follows:
Instead of:
$e->set(index=>'twitter', type=>'tweet', data=> { tweet => 'My tweet', date => '2010-01-01' } );
you should include the type name in the data:
$e->set(index=>'twitter', type=>'tweet', data=> { tweet=> { tweet => 'My tweet', date => '2010-01-01' }} );
The _source key that is returned from a "get()" contains the original JSON string that was used to index the document initially. ElasticSearch parses JSON more leniently than JSON::XS, so if invalid JSON is used to index the document (eg unquoted keys) then $e->get(....) will fail with a JSON exception.
_source
$e->get(....)
Any documents indexed via this module will be not susceptible to this problem.
scroll() is broken in version 0.6.0 of ElasticSearch.
See http://github.com/elasticsearch/elasticsearch/issues/issue/136
This is an alpha module, so there will be bugs, and the API is likely to change in the future, as the API of ElasticSearch itself changes.
If you have any suggestions for improvements, or find any bugs, please report them to http://github.com/clintongormley/ElasticSearch.pm/issues. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
Hopefully I'll be adding an ElasticSearch::QueryBuilder (similar to SQL::Abstract) which will make it easier to generate valid queries for ElasticSearch.
You can find documentation for this module with the perldoc command.
perldoc ElasticSearch
You can also look for information at:
GitHub
http://github.com/clintongormley/ElasticSearch.pm
RT: CPAN's request tracker
http://rt.cpan.org/NoAuth/Bugs.html?Dist=ElasticSearch
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/ElasticSearch
CPAN Ratings
http://cpanratings.perl.org/d/ElasticSearch
Search CPAN
http://search.cpan.org/dist/ElasticSearch/
Thanks to Shay Bannon, the ElasticSearch author, for producing an amazingly easy to use search engine.
Copyright 2010 Clinton Gormley.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
To install ElasticSearch, copy and paste the appropriate command in to your terminal.
cpanm
cpanm ElasticSearch
CPAN shell
perl -MCPAN -e shell install ElasticSearch
For more information on module installation, please visit the detailed CPAN module installation guide.