Clinton Gormley > Elastic-Model-0.28 > Elastic::Manual::Searching

Download:
Elastic-Model-0.28.tar.gz

Annotate this POD

View/Report Bugs
Source   Latest Release: Elastic-Model-0.29_2-TRIAL

NAME ^

Elastic::Manual::Searching - Which search method to use with a View, and how to use the results.

VERSION ^

version 0.28

DESCRIPTION ^

Once you have configured your view correctly, you need to call a method on it in order to produce search results.

The three main methods are search(), scroll(), and scan(). All three methods return an iterator, but each method has a different purpose. The correct method should be chosen to match the situation.

This document discusses how to use the returned iterator, and when to use which search method.

RESULTS ITERATOR ^

Iterator basics

All three search methods return an iterator based on Elastic::Model::Role::Results and Elastic::Model::Role::Iterator, which works pretty much like any iterator, eg:

    $it->first;         # first element
    $it->next;          # next element
    $it->prev;          # previous element
    $it->last;          # last element
    $it->current;       # current element
    $it->shift;         # return first element and remove it from $it

    $it->all;           # all elements
    $it->slice(0,10);   # elements 0..9

    $it->peek_next;     # return next element but don't move the cursor
    $it->peek_prev;     # return next element but don't move the cursor

    $it->has_next;      # 1 / 0
    $it->has_prev;      # 1 / 0
    $it->is_first;      # 1 / 0
    $it->is_last;       # 1 / 0
    $it->even;          # 1 / 0
    $it->odd;           # 1 / 0
    $it->parity;        # even / odd

    $it->size;          # number of elements in $it
    $it->total;         # total number of matching docs
    $it->facets;        # any facets that were requested

What elements can the iterator return?

What's different about these iterators is the elements that they return. There are three options:

Depending on what you are doing, you may want either one of these three. For instance:

Choosing an element type

From any results iterator, you can return any of the three element types:

    $it->next_result;       # Result object
    $it->next_object;       # Object itself
    $it->next_element;      # Raw data

But that is verbose. By default, first(), next() etc all return Result objects, but you can change that:

    $it = $view->search;

    $it->next;              # Result object

    $it->as_objects;
    $it->next;              # Object itself

    $it->as_elements;
    $it->next;              # Raw data

    $it->as_results;
    $it->next;              # Result object

So the typical usage if you want a list of objects back, would be:

    my $results = $view->search->as_objects;

    while ( my $object = $results->next ) {
        do_something_with($object)
    }

WHICH SEARCH METHOD SHOULD I USE WHEN? ^

Overview of differences

In summary:

Why do I need to choose a method?

When you create an index in Elasticsearch, it is created (by default) with 5 primary shards. Each of your docs is stored in one of those shards. It is these primary shards that allow you to scale your index size. But with the flexible scaling comes complexity.

The query process

Let's consider what happens when you run a query like: "Give me the 10 most relevant docs that match "foo bar"".

So then, if you ask for page 10,000 (ie results 100,001 - 100,010), each shard has to return 100,010 docs, and the requesting node has to sort through and discard 500,040 of them!

That approach doesn't scale. More than likely the requesting node will just run out of memory and be killed. There is a good reason why search engines don't return more than 100 pages of results.

Why should I scroll? Why can't I just ask for page 2?

More than likely, your data is being updated constantly. In between your requests for page 1 and page 2, your data may have changed order, or a doc might have been added or deleted. So you could end up missing results or seeing duplicates.

Scrolling gives you consistent results. It is like paging, where it returns size docs on each request, but Elasticsearch keeps the original data around until your scroll times out.

Of course, this comes at a cost: extra disk space. That means that you shouldn't make your scroll timeouts longer than they need to be. The default is 1 minute, but you may be able to reduce that considerably depending on your use case.

Of course, sometimes consistency won't matter - it may be perfectly reasonable to show duplicates in keyword searches, but less reasonable to have duplicate or missing items in a list.

Why can't I just pull all the data in one request?

Nobody has more than 10,000 blog posts, so why not just request all the posts in a single search() and specify size => 10_000?

The answer is: memory usage.

Each node needs to return 10,000 docs. The node handling the request has to make space for 50,000 docs, then sort through them to find the top 10,000. That may be fine as a one-off request, but when you have thousands of those happening concurrently, you're going to run out of memory pretty quickly.

But I need to retrieve all 10 billion docs!

OK, now we're in a different league. You can retrieve all the docs in your index, as long as you don't need them to be sorted: use scan(). Scanning works as follows (we'll assume that size is 10, but in practice you can probably make it a lot bigger):

But I really need sorting!

Do you? Do you really? Why? No user needs to page through all 5 million of your matching results. Google only returns 1,000 results, for good reason.

OK, OK, so there may be situations where need to retrieve large numbers of sorted results. The trick here is to break them up into chunks. For instance, you could request all docs created in October, then November etc. How you do it really depends on your requirements.

DIFFERENCES BETWEEN THE METHODS ^

search()

    $results = $view->search;

search() retrieves the best matching results up to a maximum of size and returns them all in an Elastic::Model::Results object.

The "size" in Elastic::Model::Results attribute contains the number of results that are stored in the iterator. The "total" in Elastic::Model::Results attribute contains the total number of matching docs in Elasticsearch.

scroll()

    $results = $view->scroll('1m');

scroll() takes a timeout parameter, which defaults to 1m (one minute). It retrieves size results and wraps them in an Elastic::Model::Results::Scrolled object.

As you iterate through the results, you will eventually request a next() doc which isn't available in the buffer. The iterator will request the next tranche of results from Elasticsearch. It is important to make sure that the timeout is longer than the time between requests, otherwise it will throw an error and you will need to start scrolling again.

The "size" in Elastic::Model::Results::Scrolled attribute contains the number of docs in Elasticsearch that match the query and are available to pull (ie initially, it is the same as the "total" in Elastic::Model::Results::Scrolled attribute).

scan()

    $results = $view->scan('1m');

scan() is pretty similar to "scroll()". It takes a timeout parameter, which defaults to 1m (one minute). However, it retrieves a maximum of number_of_primary_shards * size results in a single request and wraps them in an Elastic::Model::Results::Scrolled object. So you may want to consider reducing the size parameter when scanning.

When scrolling, there is a good chance that you want to load all of the results into memory. However, when scanning through billions of docs, you don't want to do that. Instead of using next() you should use shift():

    while ( my $result = $results->shift ) {
        do_something_with($result)
    }

This means, obviously, that prev() won't work - there is no previous doc. You've thrown it away.

When using shift(), while the "size" in Elastic::Model::Results::Scrolled attribute starts out the same as the "total" in Elastic::Model::Results::Scrolled attribute, it will decrement by one for each shift() call.

SEE ALSO ^

AUTHOR ^

Clinton Gormley <drtech@cpan.org>

COPYRIGHT AND LICENSE ^

This software is copyright (c) 2014 by Clinton Gormley.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

syntax highlighting: