NAME

KinoSearch::Docs::Tutorial::Analysis - How to choose and use Analyzers.

DEPRECATED

The KinoSearch code base has been assimilated by the Apache Lucy project. The "KinoSearch" namespace has been deprecated, but development continues under our new name at our new home: http://lucy.apache.org/

DESCRIPTION

Try swapping out the PolyAnalyzer in our Schema for a Tokenizer:

    my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
    my $type = KinoSearch::Plan::FullTextType->new(
        analyzer => $tokenizer,
    );

Search for senate, Senate, and Senator before and after making the change and re-indexing.

Under PolyAnalyzer, the results are identical for all three searches, but under Tokenizer, searches are case-sensitive, and the result sets for Senate and Senator are distinct.

PolyAnalyzer

What's happening is that PolyAnalyzer is performing more aggressive processing than Tokenizer. In addition to tokenizing, it's also converting all text to lower case so that searches are case-insensitive, and using a "stemming" algorithm to reduce related words to a common stem (senat, in this case).

PolyAnalyzer is actually multiple Analyzers wrapped up in a single package. In this case, it's three-in-one, since specifying a PolyAnalyzer with language => 'en' is equivalent to this snippet:

    my $case_folder  = KinoSearch::Analysis::CaseFolder->new;
    my $tokenizer    = KinoSearch::Analysis::Tokenizer->new;
    my $stemmer      = KinoSearch::Analysis::Stemmer->new( language => 'en' );
    my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
        analyzers => [ $case_folder, $tokenizer, $stemmer ], 
    );

You can add or subtract Analyzers from there if you like. Try adding a fourth Analyzer, a Stopalizer for suppressing "stopwords" like "the", "if", and "maybe".

    my $stopalizer = KinoSearch::Analysis::Stopalizer->new( 
        language => 'en',
    );
    my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
        analyzers => [ $case_folder, $tokenizer, $stopalizer, $stemmer ], 
    );

Also, try removing the Stemmer.

    my $polyanalyzer = KinoSearch::Analysis::PolyAnalyzer->new(
        analyzers => [ $case_folder, $tokenizer ], 
    );

The original choice of a stock English PolyAnalyzer probably still yields the best results for this document collection, but you get the idea: sometimes you want a different Analyzer.

When the best Analyzer is no Analyzer

Sometimes you don't want an Analyzer at all. That was true for our "url" field because we didn't need it to be searchable, but it's also true for certain types of searchable fields. For instance, "category" fields are often set up to match exactly or not at all, as are fields like "last_name" (because you may not want to conflate results for "Humphrey" and "Humphries").

To specify that there should be no analysis performed at all, use StringType:

    my $type = KinoSearch::Plan::StringType->new;
    $schema->spec_field( name => 'category', type => $type );

Highlighting up next

In our next tutorial chapter, KinoSearch::Docs::Tutorial::Highlighter, we'll add highlighted excerpts from the "content" field to our search results.

COPYRIGHT AND LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install KSx::Simple, copy and paste the appropriate command in to your terminal.

cpanm

cpanm KSx::Simple

CPAN shell

perl -MCPAN -e shell
install KSx::Simple

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)