Dave Rolsky > Lingua-ZH-CCDICT-0.05 > Lingua::ZH::CCDICT

Download:
Lingua-ZH-CCDICT-0.05.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Module Version: 0.05   Source  

NAME ^

Lingua::ZH::CCDICT - An interface to the CCDICT Chinese dictionary

SYNOPSIS ^

  use Lingua::ZH::CCDICT;

  my $dict = Lingua::ZH::CCDICT->new( storage => 'InMemory' );

DESCRIPTION ^

This module provides a Perl interface to the CCDICT dictionary created by Thomas Chin. This dictionary is indexed by Unicode character number (traditional character), and contains information about these characters.

CCDICT is released under a Creative Commons Attribution License (version 2.5).

The dictionary contains the following information, though not all information is avaialable for every character.

In addition, the dictionary contains English definitions (often multiple definitions), and romanizations for the character in different languages and systems. The romanizations available include Pinjim for Hakka, Jyutping for Cantonese and (Hanyu) Pinyin for Mandarin.

DICTIONARY BUGS ^

The CCDICT dictionary is distributed by Thomas Chin in a simple but non-standard, textual ASCII-only format. I've tried to work around errors or ambiguities in the dictionary data, although there are probably still oddities lurking. Please send bug reports to me so I can figure out whether the error is in my code or the dictionary itself.

STORAGE ^

This module is capable of parsing the CCDICT format file, and can also store the data in other formats (just Berkeley DB fo rnow).

Each storage system is implemented via a module in the Lingua::ZH::CCDICT::Storage::* class hierarchy. All of these modules are subclasses of Lingua::ZH::CCDICT class, and implement its methods for searching the dictionary.

In addition some storage classes may offer additional methods.

Storage Subclasses

The following storage subclasses are available:

USAGE ^

This module allows you to look up information in the dictionary based on a number of keys. These include the Unicode character (as a character, not its number), stroke count, radical number, and any of the various romanization systems.

METHODS ^

This class provides the following methods.

Lingua::ZH::CCDICT->new(...)

This method always takes at least one parameter, "storage". This indicates what storage subclass to use. The current options are "InMemory" and "BerkeleyDB".

Any other parameters given will be passed to the appropriate subclass's new() method.

$dict->parse_source_file($filename)

If you don't specify a file, then it will use the data file distributed with this module. This is probably what you want, unless you have a local copy of the dictionary that you want to work with. Note that the dictionary format has changes a fair bit between versions, so this probably won't work with much older or newer versions of the CCDICT data.

This method is what does the real work of creating a dictionary. Note that if you are not using the InMemory storage subclass, you only need to parse the source file once, and then you can reuse the stored data.

MATCH METHODS ^

When doing a lookup based on the romanization of a character, the tone is indicated with a number at the end of the syllable, as opposed to using the Unicode character combining the latin letter with the diacritic.

In addition, lookups based on a Pinyin romanization should use the u-with-umlaut character (character 252 in Unicode) rather than two "u" characters.

The return value for any lookup will be an object in a Lingua::ZH::CCDICT::ResultSet subclass.

Result sets always return matches in ascending Unicode character order.

$ccdict->match_unicode(@chars)

This method matches on one or more Unicode characters. Unicode characters should be given as Perl characters (i.e. chr(0x7D20)), not as a number.

This dictionary index uses traditional Chinese characters. Simplified character lookups will not work (but you could use Encode::HanConvert to convert simple to traditional first).

$ccdict->match_radical(@numbers)

Given a set of numbers, this method returns those characters containing the specified radical(s).

$ccdict->match_index(@numbers)

Given a set of numbers, this method returns those characters containing the specified index(es).

$ccdict->match_stroke_count(@numbers)

Given a set of numbers, this method returns those characters containing the specified number(s) of strokes.

$ccdict->match_cangjie(@codes)

Given a set of Cangjie codes, this method returns the character(s) for those code(s).

$ccdict->match_four_corner(@codes)

Given a set of Four Corner codes, this method returns the character(s) for those code(s).

$ccdict->match_pinjim(@romanizations)

$ccdict->match_jyutping(@romanizations)

$ccdict->match_pinyin(@romanizations)

$ccdict->all_characters()

Returns a result set containing all of the characters in the dictionary.

$ccdict->entry_count()

Returns the number of entries in the dictionary

ENVIRONMENT VARIABLES ^

There are several environment variables you can set to change this module's behavior.

AUTHOR ^

David Rolsky <autarch@urth.org>

COPYRIGHT ^

Copyright (c) 2002-2007 David Rolsky. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

CCDICT is copyright (c) 1995-2006 Thomas Chin.

SEE ALSO ^

Lingua::ZH::CEDICT - for converting between Chinese and English.

Encode::HanConvert - for converting between simplified and traditional characters in various character sets.

http://www.chinalanguage.com/dictionaries/CCDICT/ - the home of the CCDICT dictionary.

syntax highlighting: