Ricardo SIGNES > Parse-GutenbergRoget-0.022 > Parse::GutenbergRoget

Download:
Parse-GutenbergRoget-0.022.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Module Version: 0.022   Source  

NAME ^

Parse::GutenbergRoget - parse Project Gutenberg's Roget's Thesaurus

VERSION ^

version 0.022

  $Id$

SYNOPSIS ^

 use Parse::GutenbergRoget

 my %section = parse_roget("./roget15a.txt");

 print $section{1}[0][0]{text}; # existence

DESCRIPTION ^

A Roget's Thesaurus is more than the simple synonym/antonym finder included in many dictionary sets. It organizes words into semantically realted categories, so that words with related meanings can be found in proximity to one another, with the level of proximity indicating the level of similarity.

Project Gutenberg has produced an etext of the 1911 edition of Roget's Thesaurus, and later began to revise it, in 1991. While it's not the best Roget-style thesaurus available, it's the best public domain electronic thesaurus datasource I've found.

This module parses the file's contents into a Perl data structure, which can then be stored in systems for searching and browsing it. This module does not implement those systems.

The code is not complete. This means that everything that can be parsed is not yet being parsed. It's important to realize that not everything is going to be parseable. There are too many typos and broken rules which, due to the lousy nature of the rules, create ambiguity. For a description of these rules see "RULES" below.

FUNCTIONS ^

parse_roget($filename)

This function, exported by default, will attempt to open, read, and parse the named file as a Project Gutenberg Roget's Thesaurus. It has only been tested with roget15a.txt, which is not included in the distribution, because it's too big.

It returns a hash with the following structure:

 %section = (
   ...
   '100a' => {
     major => 100, # major and minor form section identity
     minor => 'a',
     name  => 'Fraction',
     comments    => [ 'Less than one' ],
     subsections => [
       {
         type   => 'N', # these entries are nouns
         groups => [
           { entries => [
             { text => 'fraction' },
             { text => 'fractional part' }
           ] },
           { entries => [ { text => 'part &c. 51' } ] }
         ]
       },
       {
         type   => 'Adj',
         groups => [ { entries => [ ... ] } ]
       }
     ]
   }
   ...
 );

This structure isn't pretty or perfect, and is subject to change. All of its elements are shown here, except for one exception, which is the next likely subject for change: flags. Entries may have flags, in addition to text, which note things like "French" or "archaic". Entries (or possibly groups) will also gain cross reference attribues, replacing the ugly "&c. XX" text. I'd also like to deal with references to other subsections, which come in the form "&c. Adj." There isn't any reason for these to be needed, I think.

parse_sections($filename)

This function is used internally by parse_roget to read the named file, returning the above structure, parsed only to the section level.

bloom_sections(\%sections)

Given a reference to the section hash, this subroutine expands the sections into subsections, groups, and entries.

THE FILE ^

TODO ^

Well, a good first step would be a TODO section.

I'll write some tests that will only run if you put a roget15a.txt file in the right place. I'll also try the tests with previous revisions of the file.

I'm also tempted to produce newer revisions on my own, after I contact the address listed in the file. The changes would just be to eliminate anomalies that prevent parsing. Distraction by shiny objects may prevent this goal.

The flags and cross reference bits above will be implemented.

The need for Text::CSV_XS may be eliminated.

Entries with internal quoting (especially common in phrases) will no longer become UNPARSED.

I'll try to eliminate more UNKNOWN subsection types.

AUTHOR ^

Ricardo Signes, <rjbs@cpan.org>

BUGS ^

Please report any bugs or feature requests to bug-parse-gutenbergroget@rt.cpan.org, or through the web interface at http://rt.cpan.org. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

COPYRIGHT ^

Copyright 2004 Ricardo Signes, All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: