Lingua::EO::Orthography - A orthography/substitute converter for Esperanto characters
This document describes Lingua::EO::Orthography version 0.04.
0.04
Lingua::EO::Orthography (This document)
Lingua::EO::Orthography::EO
Lingua::EO::Orthography::JA
use utf8; use Lingua::EO::Orthography; my ($converter, $original, $converted); # orthographize ... $converter = Lingua::EO::Orthography->new; $original = q(C^i-momente, la songha h'orajxo ^sprucigas aplauwdon.); $converted = $converter->convert($original); # substitute ... (X-system) $converter->sources([qw(orthography)]); # (accepts multiple notations) $converter->target('postfix_x'); # same as above: # $converter = Lingua::EO::Orthography->new( # sources => [qw(orthography)], # target => 'postfix_x', # ); $original = q(Ĉi-momente, la sonĝa ĥoraĵo ŝprucigas aplaŭdon); $converted = $converter->convert($original);
6 letters in the Esperanto alphabet did not exist in ASCII. Their letters, which have supersigns (eo: supersignoj), are often spelled in substitute notations (eo: surogataj skribosistemoj) for the history, namely, for the ages of typography and typewriter. Currently, it is not unusual to spell them in orthography (eo: ortografio) by the spread of Unicode (eo: Unikodo). However, there is still much environment where the input with a keyboard is difficult, and people may treat an old document described in substitute notation.
This object oriented module provides you a conversion of their notations.
This module is on stage of beta release, and the API may be changed. Your feedback is welcome.
The following notation names are usable in new(), add_sources(), and so on.
I am going to expand an API in the future, and you will can add notations except them.
orthography
Ĉ ĉ Ĝ ĝ Ĥ ĥ Ĵ ĵ Ŝ ŝ Ŭ ŭ (\x{108} \x{109} \x{11C} \x{11D} \x{124} \x{125} \x{134} \x{135} \x{15C} \x{15D} \x{16C} \x{16D})
It is the orthography of the Esperanto alphabet. The converter treats letters with supersign, which exist in Unicode. The character encoding is UTF-8.
You should use the orthography today unless there is some particular reason because Unicode was spread sufficiently. Perl 5.8.1 or later also treat it correctly.
I recommend that you treat UTF-8 flagged string in your program throughout and convert string in only input from external or output to external (on demand), for to correctly work functions such as length() in the condition which turns utf8 pragma on. It is the same as the principle of Encode and Perl IO layer.
length()
zamenhof
Ch ch Gh gh Hh hh Jh jh Sh sh U u
It is a substitute notation, which places h as a postfix, however, does not place it for u.
h
u
It was suggested by Dr. Zamenhof, the father of Esperanto, in Fundamento de Esperanto and people called it Zamenhof system (eo: Zamenhofa sistemo). For this reason, people also called it the second orthography, but it is not used very much today.
It has a problem that string which range between roots (such as 'flug/haven/o') looks like substituted string in several words such as 'flughaveno' (en: 'airport'). This module does not evade this problem at the present time.
capital_zamenhof
CH ch GH gh HH hh JH jh SH sh U u
It is a variant of 'capital_zamenhof' notation.
It places a capital H as a postfix of a capital alphabet.
H
postfix_h
Ch ch Gh gh Hh hh Jh jh Sh sh Uw uw
It is an extended notation of 'capital_zamenhof' notation.
It places w as a postfix of u.
w
People called it H-system (eo: H-sistemo).
postfix_capital_h
CH ch GH gh HH hh JH jh SH sh UW uw
It is a variant of 'postfix_h' notation.
It places a capital H or W as a postfix of a capital alphabet.
W
postfix_x
Cx cx Gx gx Hx hx Jx jx Sx sx Ux ux
It is a substitute notation, which places x as a postfix.
x
People called it X-system (eo: X-sistemo, iksa sistemo).
People widely use it as a substitute notation, because X does not exist in the Esperanto alphabet, and was not used except for the case of to describe non-Esperanto word as the original language.
postfix_capital_x
CX cx GX gx HX hx JX jx SX sx UX ux
It is a variant of 'postfix_x' notation.
It places a capital X as a postfix of a capital alphabet.
X
postfix_caret
C^ c^ G^ g^ H^ h^ J^ j^ S^ s^ U^ u^
It is a substitute notation, which places a caret ^ as a postfix.
^
People called it caret system (eo: ĉapelita sistemo).
People often use it as a substitute notation, because caret have the same shape as circumflex.
This module does not support a way, which describe u~ like u^ at the present time.
u~
u^
postfix_apostrophe
C' c' G' g' H' h' J' j' S' s' U' u'
It is a substitute notation, which places an apostrophe ' as a postfix.
'
prefix_caret
^C ^c ^G ^g ^H ^h ^J ^j ^S ^s ^U ^u
It is a substitute notation, which places a caret ^ as a prefix.
There is Lingua::EO::Supersignoj in CPAN. It provides us with correspondent functions of this module.
I compare them by the following list:
Viewpoints ::Supersignoj ::Orthography Note -------------------------- --------------- --------------------------- ---- Version 0.02 0.04 Can convert @lines Yes No *1 Have accessors Yes Yes, and it has utilities *2 Can customize notation Only 'u' No (under consideration) *3 Can treat 'flughaveno' No No (under consideration) *4 API language eo: Esperanto en: English Can convert as N:1 No Yes *5 Speed Satisfied About 400% faster *6 Immediate dependencies 1 (0 in core) 6 (2 in core) *7 Whole dependencies 1 (0 in core) 15 (8 in core) *7 Test case number 3 93 *8 License Unknown Perl (Artistic or GNU GPL) Last modified on Mar. 2003 Mar. 2010
To convert @lines with Lingua::EO::Orthography:
@lines
@converted_lines = map { $converter->convert($_) } @original_lines;
Lingua::EO::Orthography has utility methods, what are all_sources(), add_sources() and remove_sources().
I plan to design the API of this function:
$converter = Lingua::EO::Orthography->new( notations => { postfix_asterisk => [qw(C* c* G* g* H* h* J* j* S* s* U* u*)], }, ); $notations_ref = $converter->notations; @notations = $converter->all_notations; @notations = $converter->notations({ postfix_underscore => [qw(C_ c_ G_ g_ H_ h_ J_ j_ S_ s_ U_ u_)], }); $converter->add_notations( postfix_diacritics => [qw(C^ c^ G^ g^ H^ h^ J^ j^ S^ s^ U~ u~)], );
$converter = Lingua::EO::Orthography->new( ignore_words => [qw( bushaltejo flughaveno Kinghaio ... )], ); $ignore_words_ref = $converter->ignore_words; @ignore_words = $converter->all_ignore_words; @ignore_words = $converter->ignore_words([qw(kuracherbo)]); $converter->add_ignore_words([qw( longhara navighalto ... )]);
I expect that you may design your practical application to accept multiple notations, from my experience.
I included an example in the distribution. Lingua::EO::Orthography can convert string into the orthography at once, such as examples/converter.pl. The correspondent in Lingua::EO::Supersignoj is examples/correspondent.pl. In this case, you must convert string while you replace source notation.
Lingua::EO::Orthography can convert string about 400% faster than Lingua::EO::Supersignoj.
The reason for the difference is to cache a pattern of regular expression and a character converting table to replace string, with Memoize. Furthermore, Lingua::EO::Orthography can convert characters from multiple notations at once.
See examples/benchmark.pl in this distribution.
The source of dependencies is http://deps.cpantesters.org/.
Such number excludes modules for building and testing.
Any dependencies of Lingua::EO::Orthography have a certain favorable opinion. I quite agree with those recommendation.
But, I consider reducing dependencies. I already abandon make this module to depend namespace::clean, namespace::autoclean, and so on.
Such number excludes author's tests.
new
$converter = Lingua::EO::Orthography->new(%init_arg);
Returns a Lingua::EO::Orthography object, which is a converter.
Accepts a hash as a converting alignment. You can assign sources and/or target as key of the hash.
sources
target
sources => \@source_notations
Accepts an array reference or :all as source notations.
:all
:all is equivalent to zamenhof, capital_zamenhof, postfix_h, postfix_capital_h, postfix_x, postfix_capital_x, postfix_caret, postfix_apostrophe and prefix_caret.
If you omit to assign it, the converter consider that you assign :all to it.
If you assign a value except :all and an array reference, number of notation elements is 0 or notations elements has an unknown notation or undef, the converter throws an exception.
undef
target => $target_notation
Accepts a string as target notation.
If you omit to assign it, the converter consider that you assign orthography to it.
If you assign an unknown notation or undef, the converter throws an exception.
$source_notations_ref = $converter->sources;
Returns source notations as an array reference. If you want to get it as a list, you can use all_sources().
$source_notations_ref = $converter->sources(\@notations);
Accepts an array reference as source notations. You can use notations as new() constructor.
Return value is the same as when an argument was not passed.
$target_notation = $converter->target;
Returns target notation as a scalar.
$target_notation = $converter->target($notation);
Accepts a string as target notation. You can use notations as new() constructor.
convert
$converted_string = $converter->convert($original_string);
Accepts string, convert it, and returns converted string. Argument string was not polluted by this method, that is to say, argument string was not changed by side-effect of this method. A conversion of string is based on notations, which assigned at new() constructor or accessors of sources() and target().
String are case-sensitive. That is to say, the converter does not consider cX to substitute notations in 'postfix_x' notation, and do not convert it.
cX
String of arguments should turn UTF8 flag on. String of return value also became on.
An URL or an e-mail address may have string, which was consused itself with substitute notation. If you do not will convert it, run convert() each words after to split() a sentence into words. This let you that the converter except string, which includes :// or @, from the target of the conversion. See RFC 2396 and 3986 for URI, and see RFC 5321 and 5322 for e-mail address. I described a concrete example to examples/ignore_addresses.pl in the distribution.
split()
://
@
all_sources
@all_source_notations = $converter->all_sources;
Returns source notations as a list. If you want to get it as an array reference, you can use sources().
add_sources
$source_notations_ref = $converter->add_sources(@adding_notations);
Adds passed notations as a list to source notations. You can use notations as new() constructor.
Returns source notations as an array reference.
remove_sources
$source_notations_ref = $converter->remove_sources(@removing_notations);
Removes passed notations as a list from source notations. You can use notations as new() constructor.
Returns rest source notations as an array reference.
Notations after the removing must maintain at least 1. If you remove all notations, the converter throws an exception.
L. L. Zamenhof, Fundamento de Esperanto, 1905
http://en.wikipedia.org/wiki/Esperanto_orthography
Lingua::EO::Supersignoj
http://freshmeat.net/projects/eoconv/
None reported.
No bugs have been reported.
Please report any found bugs, feature requests, and ideas for improvements to <bug-lingua-eo-orthography at rt dot cpan dot org>, or through the web interface at http://rt.cpan.org/Public/Bug/Report.html?Queue=Lingua-EO-Orthography. I will be notified, and then you'll automatically be notified of progress on your bugs/requests as I make changes.
<bug-lingua-eo-orthography at rt dot cpan dot org>
When reporting bugs, if possible, please add as small a sample as you can make of the code that produces the bug. And of course, suggestions and patches are welcome.
You can find documentation for this module with the perldoc command.
perldoc
% perldoc Lingua::EO::Orthography
The Esperanto edition of documentation is also available.
% perldoc Lingua::EO::Orthography::EO
You can also find the Japanese edition of documentation for this module with the perldocjp command from Pod::PerldocJp.
perldocjp
% perldocjp Lingua::EO::Orthography::JA
You can also look for information at:
http://rt.cpan.org/Public/Dist/Display.html?Name=Lingua-EO-Orthography
http://annocpan.org/dist/Lingua-EO-Orthography
http://search.cpan.org/dist/Lingua-EO-Orthography
http://cpanratings.perl.org/dist/Lingua-EO-Orthography
This module is maintained using git. You can get the latest version from git://github.com/gardejo/p5-lingua-eo-orthography.git.
I use Devel::Cover to test the code coverage of my tests, below is the Devel::Cover summary report on this distribution's test suite.
Devel::Cover
---------------------------- ------ ------ ------ ------ ------ ------ ------ File stmt bran cond sub pod time total ---------------------------- ------ ------ ------ ------ ------ ------ ------ .../Lingua/EO/Orthography.pm 100.0 100.0 100.0 100.0 100.0 100.0 100.0 Total 100.0 100.0 100.0 100.0 100.0 100.0 100.0 ---------------------------- ------ ------ ------ ------ ------ ------ ------
More tests
Less dependencies
To provide an API to add user's notation
To correctly treat words such as flughaveno (flug/haven/o) in 'postfix_h' notation with user's lexicon
flughaveno
flug/haven/o
To correctly treat words such as ankaŭ in 'zamenhof' notation with user's lexicon
ankaŭ
To release a Moose friendly class such as Lingua::EO::Orthography::Moosified
Lingua::EO::Orthography::Moosified
<moriya at cpan dot org>, http://gardejo.org/
<moriya at cpan dot org>
Juerd Waalboer wrote Lingua::EO::Supersignoj, which this module refer to.
Copyright (c) 2010 MORIYA Masaki, alias Gardejo
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlgpl and perlartistic.
The full text of the license can be found in the LICENSE file included with this distribution.
To install Lingua::EO::Orthography, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::EO::Orthography
CPAN shell
perl -MCPAN -e shell install Lingua::EO::Orthography
For more information on module installation, please visit the detailed CPAN module installation guide.