Marvin Humphrey > KinoSearch > KinoSearch::Docs::IRTheory

Download:
KinoSearch-0.315.tar.gz

Annotate this POD

CPAN RT

New  1
Open  4
View/Report Bugs
Source  

NAME ^

KinoSearch::Docs::IRTheory - Crash course in information retrieval.

DEPRECATED ^

The KinoSearch code base has been assimilated by the Apache Lucy project. The "KinoSearch" namespace has been deprecated, but development continues under our new name at our new home: http://lucy.apache.org/

ABSTRACT ^

Just enough Information Retrieval theory to find your way around KinoSearch.

Terminology ^

KinoSearch uses some terminology from the field of information retrieval which may be unfamiliar to many users. "Document" and "term" mean pretty much what you'd expect them to, but others such as "posting" and "inverted index" need a formal introduction:

Since KinoSearch is a practical implementation of IR theory, it loads these abstract, distilled definitions down with useful traits. For instance, a "posting" in its most rarefied form is simply a term-document pairing; in KinoSearch, the class KinoSearch::Index::Posting::MatchPosting fills this role. However, by associating additional information with a posting like the number of times the term occurs in the document, we can turn it into a ScorePosting, making it possible to rank documents by relevance rather than just list documents which happen to match in no particular order.

TF/IDF ranking algorithm ^

KinoSearch uses a variant of the well-established "Term Frequency / Inverse Document Frequency" weighting scheme. A thorough treatment of TF/IDF is too ambitious for our present purposes, but in a nutshell, it means that...

A web search for "tf idf" will turn up many excellent explanations of the algorithm.

COPYRIGHT AND LICENSE ^

Copyright 2007-2011 Marvin Humphrey

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: