The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Data::Kanji::Tomoe - parse the data files of the Tomoe project

SYNOPSIS

    my $tomoe = Data::Kanji::Tomoe->new (
        tomoe_data_file => '/path/to/data/file',
        character_callback => \& user_callback,
    );
    $tomoe->parse ();

VERSION

This documents Data::Kanji::Tomoe version 0.05 corresponding to git commit 00597f0d29d6219cfd038cacac393c4c64b4fbe2 released on Wed Jan 11 09:17:37 2017 +0900.

DESCRIPTION

This Perl module parses the kanji or hanzi data files supplied with the Tomoe "handwriting recognition engine".

The data itself is not supplied with this module.

The parsing is based on XML::Parser. It breaks the Tomoe data into individual characters, and calls a subroutine supplied by the user with the data for each character.

METHODS

new

    use Data::Kanji::Tomoe;
    my $obj = Data::Kanji::Tomoe->new (
        tomoe_data_file => '/path/to/data/file',
        character_callback => \& user_callback,
    );

Create the object. The argument is a hash. The name of the data file to be parsed, under the key tomoe_data_file, must be supplied.

parse

    $tomoe->parse ();

Parse the XML in the Tomoe data file.

As the XML tags <character>...</character> are parsed from the file, the callback specified by character_callback is called back in the form

    &{$callback} ($obj, $character);

where $character is a hash reference with the following keys and values.

utf8

Value: the character itself.

strokes

Value: an array reference containing the strokes of the character. Each element of the array reference is a reference to an array of the points of the line. Each of these points is another reference. So, for example, if the original Tomoe data consists of

    <strokes>
        <stroke>
            <point x="1" y="2"/>
            <point x="3" y="4"/>
        </stroke>
        <stroke>
            <point x="5" y="6"/>
        </stroke>
    </strokes>

then $character->{strokes} contains something like

    [[[1, 2], [3, 4]], [[5, 6]]]

Any data which the user wishes to send can be transmitted through the object itself:

    use Data::Kanji::Tomoe;
    my $obj = Data::Kanji::Tomoe->new (
        tomoe_data_file => '/path/to/data/file',
        character_callback => \& user_callback,
        data_I_wish_to_send => {some => 'data'},
    );
    
    $obj->parse ();
    
    sub user_callback
    {
        my ($obj, $c) = @_;
        my $data = $obj->{data_I_wish_to_send};
    }

SEE ALSO

Tomoe

The Tomoe data is located at https://sourceforge.net/projects/tomoe/. As of version 0.05 of Data::Kanji::Tomoe, the most recent update of the software, version 0.6.0, dates from 29 June 2007.

Other sources of kanji shape data

The Tomoe data for the Japanese kanji contains many errors. Users who do not specifically know what data to use are recommended not to use the Tomoe data.

The following projects contain alternative data.

KanjiVG

A better set of data for most purposes is the data of the KanjiVG project. This is XML data using SVG (Scalar Vector Graphics) for kanji line data.

Kanji Database / CJKVI-IDS

Another project containing visual kanji data is http://kanji-database.sourceforge.net/, which contains "Ideographic Description Sequences" (IDS) rather than line data. See also https://github.com/cjkvi/cjkvi-ids for the Github version of this project. CJKVI-IDS uses a simple text format. This is related to GlyphWiki.

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2012-2017 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.