The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
#!/usr/bin/env perl
#
# sample output run appended as data

use utf8;
use strict;
use 5.10.1;
use autodie;
# delay fatal warnings till runtime; otherwise screws up compiler msgs
use warnings; # qw[ FATAL all   ];
use open        qw[ :utf8 :std  ];
use charnames   qw[ :full       ];

use Carp        qw[ carp croak cluck confess ];
use Encode      qw[ decode      ];

use constant SHOW_UNIHAN => 1;

# UAX#24 et alios
use Unicode::UCD qw{
    charinfo charblock charscript
    charblocks charscripts charinrange
    general_categories bidi_types
    compexcl casefold casespec namedseq
};

use Unicode::Normalize qw[ NFC NFD ];   # UAX#15
use Unicode::Unihan;                    # UAX#38
use Unicode::GCString;                  # UAX#29
use Unicode::LineBreak qw(:all);        # UAX#14-C2

use Lingua::KO::Hangul::Util    qw[ :all ];

use Lingua::JA::Romanize::Japanese;
use Lingua::ZH::Romanize::Pinyin;
use Lingua::KO::Romanize::Hangul;

sub utf8::is_utf8;  # this one is built-in!

sub apply_tones($$);
sub banner;
sub banner_paragraph($$);
sub char_inform(_);
sub entrapment();
sub hangul(_);
sub pinyin(_);
sub romaji(_);
sub said($$);
sub tabbed_sizing;
sub utf8(_);
sub wrap_line($);
sub wrap_paragraph($);

entrapment();

my $vietnamese = <<'DONE WITH VIETNAMESE';
    For example, in Vietnamese both tree leaves and the sky are xanh
    (to distinguish, one may use xanh lá cây "leaf grue" for green and
    xanh dương "ocean grue" for blue).
DONE WITH VIETNAMESE

my $thai = <<'DONE WITH THAI';
    In the Thai language, เขียว (khiaw) means green except when referring
    to the sky or the sea, when it means blue; เขียวชอุ่ม (khiaw cha-um), เขี
    ยวขจี (khiaw khachi), and เขียวแปร๊ด (khiaw praed) have all meant either intense
    blue or garish green, although the latter is becoming more usual as
    the language 'learns' to distinguish blue and green.
DONE WITH THAI

my $chinese = <<'DONE WITH CHINESE';

$path = "婴儿服饰";

漢字 kanji
東京 Tōkyō
京都 Kyōto

春曉  孟浩然	Ceon1 Hiu2  Maang6 Hou6jin4

春眠不覺曉,	Ceon1 min4 bat1 gok3 hiu2,

處處聞啼鳥。	cyu3 cyu3 man4 tai4 niu5.

夜來風雨聲,	Je6 loi4 fung1 jyu5 sing1,

花落知多少?	faa1 lok6 zi1 do1 siu2?


    Due to its status as a prestige dialect, it often called "Standard Cantonese" (simplified Chinese: 标准粤语; traditional Chinese: 標準粵語; Jyutping: biu1zeon2 jyut6jyu5; Guangdong Romanization:Biu1 zên2 yud6 yu5).

    Chinese has a word 青 (qīng) that can refer to both, though it also has
    separate words for blue (蓝 / 藍, lán) and green (绿 / 綠, lǜ).

    The modern Chinese language has the blue-green distinction (蓝/ 藍
    lán for blue and 绿 / 緑 lǜ for green); however, another word which
    predates the modern vernacular, qīng (Chinese: 青), is also used. It
    can refer to either blue or green, or even (though much less
    frequently) to black, as in xuánqīng (Chinese: 玄青 where Chinese: 玄
    refers to black). For example, the Flag of the Republic of China is
    today still referred to as qīng tiān, bái rì, mǎn dì hóng ("Blue
    Sky, White Sun, Whole Ground Red" — Chinese: 青天,白日,满地红);
    whereas qīng cài (Chinese: 青菜) is the Chinese word for "green
    vegetable". Qīng 青 was the traditional designation of both blue and
    green for much of the history of the Chinese language, while 蓝 lán
    and 绿 lǜ were introduced relatively more recently, as a part of the
    adoption of modern Vernacular Chinese as the social norm, replacing
    Classical Chinese.

    The Chinese version is simply, 妈妈给我们的钱,我已经买了糖了. Māma gěi wǒmen de qián, wǒ yǐjīng mǎile táng le. This is translated somewhat directly as, "The money Mom gave us, I already bought candy," lacking a preface as in English.

    Another major difference between the syntax of Chinese and languages like English lies in the stacking order of modifying clauses. 昨天发脾气的外交警察取消了沒有交钱的那些人的入境证. Zuótiān fāpíqì de wàijiāo jǐngchá qǔxiāole méiyǒu jiāoqián de nàxiē rén de rùjìngzhèng. Using the Chinese order in English, that sentence would be [blah blah blah].

    Tones http://en.wikipedia.org/wiki/Cantonese_phonology#Tones

    Guangzhou Cantonese has seven tones, but Hong Kong Cantonese has six with
    the high-falling tone merging with the high tone. Although it is often said
    to have nine or eleven, the additional tones added in the counts are
    entering tones. In Chinese, the number of possible tones depends on the
    syllable type. There are six contour tones in syllables that end in a vowel
    or nasal consonant. (Some of these have more than one realization, but such
    differences are seldom used to distinguish words.) In syllables that end in
    a stop consonant, the number of tones is reduced to three; in Chinese
    descriptions, these "entering tones" are treated separately, so that
    Cantonese is traditionally said to have nine tones. However, phonetically
    these are a conflation of tone and syllable type; the number of phonemic
    tones is six in Hong Kong and seven in Guangzhou.

    Syllable type       Open syllables                                                                     Stopped syllables

                                            d.=dark, l.=light, U.=upper, L.=lower
	Tone name	d.level	      d.rising	   d.depart.	 l.level	l.rising     l.depart.	   U. d.enter. L. d.enter.  l.enter.
			(陰平)	      (陰上)	   (陰去)	 (陽平)		(陽上)	     (陽去)	   (上陰入)    (下陰入)	    (陽入)
    Description		high level    medium	   medium	 low falling	low	     low	   high		medium	    low
			high falling  rising	   level	 v.l. level	rising	     level	   level	level	    level
    Yale or Jyutping
	 tone number	1	      2		    3		  4		5	      6		    7 (or 1)	8 (or 3)    9 (or 6)
      Example		詩	      史	    試		  時		市	      是	    識		錫	    食
      Tone letter	siː˥, siː˥˧   siː˧˥	    siː˧	  siː˨˩, siː˩	siː˩˧	      siː˨	    sɪk˥	sɛːk˧	    sɪk˨
      IPA diacritic	síː, sîː      sǐː	    sīː		  si̖ː, sı̏ː	  si̗ː		 sìː	       sɪ́k	    sɛ̄ːk	 sɪ̀k
      Yale diacritic	sī, sì	      sí	    si		  sīh, sìh	síh	      sih	    sīk		sek	    sihk


For purposes of meters in Chinese poetry, the first and fourth tones are the "level tones" (平聲), while the rest are the "oblique tones" (仄聲).

The first tone can be either high level or high falling usually without affecting the meaning of the words being spoken. Most speakers are in general not consciously aware of when they use and when to use high level and high falling. In Hong Kong, most speakers have merged the high and high falling tones. In Guangzhou, the high falling tone is disappearing as well, but is still prevalent among certain words (saam-high falling means the number three, where saam-high means shirt).

The numbers "394052786" when pronounced in Cantonese, will give the nine tones in order (Romanisation (Yale) saam1, gau2, sei3, ling4, ng5, yi6, chat7, baat8, luk9), thus giving a good mnemonic for remembering the nine tones.

DONE WITH CHINESE

my $korean = <<'DONE WITH KOREAN';

漢字 kanji
東京 Tōkyō
京都 Kyōto

    The Korean word 푸르다 (pureuda) can mean either green or blue.

    The native Korean word 푸르다 (Revised Romanization: pureu-da adj.) may
    mean either blue or green, or bluish green. This word 푸르다 is used as
    in 푸른 하늘 (pureun haneul, blue sky) for blue or as in 푸른 숲 (pureun
    sup, green forest) for green. Distinct words for blue and green are
    also used; 파란 (paran adj.), 파란색/파랑 (paransaek/parang n.) for blue,
    초록 (chorok adj./n.), 초록색 (choroksaek n. or for short, 녹색 noksaek n.)
    for green. However, in the case of a traffic light, paran is used
    for the green light meaning go, even though the word is typically
    used to mean blue. Cheong 청 is also used for both blue and green. It
    is a loan from Chinese (靑, pinyin: qing) and is used in the proper
    name Cheong Wa Dae (청와대 or Hanja: 靑瓦臺), the Blue House, which is the
    executive office and official residence of the President of the
    Republic of Korea.
DONE WITH KOREAN

my $japanese = <<'DONE WITH JAPANESE';

    文字化け mojibake


呼ぶ

    Tokyo (東京 Tōkyō, "Eastern Capital"), officially Tokyo Metropolis (東京都 Tōkyō-to),is one of the 47 prefectures of Japan. It is located on the eastern side of the main island Honshū and includes the Izu Islands and Ogasawara Islands. Tokyo Metropolis was formed in 1943 from the merger of the former Tokyo Prefecture (東京府 Tōkyō-fu) and the city of Tokyo (東京市 Tōkyō-shi).

    Kyoto (京都 Kyōto) (Japanese pronunciation: [kʲoːto]) is a city in the central part of the island of Honshū, Japan. 

    The Japanese language is written with a combination of three scripts: Chinese characters called kanji (漢字), and two syllabic scripts made up of modified Chinese characters, hiragana (ひらがな or 平仮名) and katakana (カタカナ or 片仮名). The Latin alphabet, rōmaji (ローマ字), is also often used in modern Japanese, especially for company names and logos, advertising, and when entering Japanese text into a computer.

The main distinction in Japanese accents is
between Tokyo-type 	(東京式 	Tōkyō-shiki?)
and Kyoto-Osaka-type 	(京阪式 	Keihan-shiki?),
though Kyūshū-type dialects form a third, smaller group.

    In Japanese, the word for blue (青 ao) is often used for colors that
    English speakers would refer to as green, such as the color of a
    traffic signal meaning "go".

    The Japanese word ao (青, n., aoi (青い, adj.)), exactly the same
    kanji character as the Chinese qīng above, can refer to either blue
    or green depending on the situation. Modern Japanese has also
    adopted the Chinese word for green (緑 midori), although this was
    not always so. Ancient Japanese did not have this distinction: the
    word midori only came into use in the Heian period, and at that time
    (and for a long time thereafter) midori was still considered a shade
    of ao. Educational materials distinguishing green and blue only came
    into use after World War II, during the Occupation[citation needed]:
    thus, even though most Japanese consider them to be green, the word
    ao is still used to describe certain vegetables, apples and
    vegetation. Ao is also the name for the color of a traffic light,
    which is bluer than in English-speaking countries. However, most
    other objects—a green car, a green sweater, and so forth—will
    generally be called midori. Japanese people also sometimes use the
    word guriin (グリーン), based on the English word "green", for colors.
    The language also has several other words meaning specific shades of
    green and blue.

DONE WITH JAPANESE

my @langs = qw{
    Mandarin Cantonese
    JapaneseKun JapaneseOn
    Korean HanyuPinlu
    Vietnamese
};


banner_paragraph(Chinese => $chinese);

if (SHOW_UNIHAN) {
    char_inform for $chinese =~ /[^\0-\x7f]/g;
    print "\n";
}

for my $cjk_string ($chinese =~ /\p{Han}+/g) {
    wrap_line sprintf "CJK %s is %s in Mandarin.",  $cjk_string, said(Mandarin => $cjk_string);
    wrap_line sprintf "CJK %s is %s in Cantonese.", $cjk_string, said(Cantonese => $cjk_string);
    wrap_line sprintf "CJK %s is %s in Pinyin.\n\n",    $cjk_string, pinyin($cjk_string);
}


banner_paragraph(Korean => $korean);

if (SHOW_UNIHAN) {
    char_inform for $korean =~ /[^\0-\x7f]/g;
    print "\n";
}

for my $cjk_string ($korean =~ /\p{Han}+/g) {
    wrap_line sprintf "CJK %s is %s in Korean.", $cjk_string, said(Korean => $cjk_string);
}


for my $hangul_string ($korean =~ /\p{Hangul}+/g) {
    wrap_line sprintf "Korean %s is %s.", $hangul_string, hangul($hangul_string);
    next;

    ##################
    # internal romanization

    my @namelist = ();
    for my $char (split //, $hangul_string) {
        for(getHangulName(ord $char)) {
            next unless length;
            s/^HANGUL SYLLABLE //;
            push @namelist, $_;
        }
    }
    printf "Korean(int) %s is %s.\n", $hangul_string, lc join("-", @namelist);
}

banner_paragraph(Japanese => $japanese);

if (SHOW_UNIHAN) {
    char_inform for $japanese =~ /[^\0-\x7f]/g;
    print "\n";
}

for my $cjk_string ($japanese =~ /\p{Han}+/g) {
    printf "CJK %s is %s in Japanese.\n", $cjk_string, said(JapaneseOn => $cjk_string);
}
print "\n";

for my $kanji_string ($japanese =~ /[\p{Han}\p{Kana}\p{InKatakana}\p{InHiragana}]+/g) {
    printf "Japanese %s is %s.\n", $kanji_string, romaji($kanji_string);
}


exit;

sub entrapment() {
    $SIG{__DIE__} = sub {
        croak "\n$0: death trap caught exception: @_" unless $^S;
    };
    $SIG{__WARN__} = sub {
        # dunno why, but carp and cluck do the same thing here
        my $kvetch = "\n$0: warn trap caught warning: @_";
        warn $kvetch;
        die "fatal(ized) warning";
    };
}

sub banner {
    say "\n====@_====\n";
}


sub tabbed_sizing {
    my ($self, $cols, $pre, $spc, $str) = @_;
    my $spcstr = $spc.$str;
    while ($spcstr =~ s/^( *)(\t+)//) {
        $cols += length($1);
        $cols += length($2) * 8 - $cols % 8;
    }
    $cols += $self->strsize(0, '', '', $spcstr);
    return $cols;
};

sub banner_paragraph($$) {
    my ($name, $text) = @_;
    banner(uc $name);
    wrap_paragraph($text);
}


UNITCHECK {

### Public Configuration Attributes (unused variable!!)
state $LB_default_config = {
    BreakIndent => 'YES',
    CharactersMax => 998,
    ColumnsMin => 0,
    ColumnsMax => 76,
    ComplexBreaking => 'YES',
    Context => 'NONEASTASIAN',
    Format => "SIMPLE",
    HangulAsAL => 'NO',
    LegacyCM => 'YES',
    Newline => "\n",
    SizingMethod => 'UAX11',
    TailorEA => [],
    TailorLB => [],
    UrgentBreaking => undef,
    UserBreaking => [],
};

state $formatter = new Unicode::LineBreak (
# makes for fewer linebreaks on this dataset:
    Context => "NONEASTASIAN",      # EASTASIAN, NONEATSIAN
    ColumnsMax => 72,
    Format => "SIMPLE",             # SIMPLE, NEWLINE, TRIM
    HangulAsAL => "YES",
    SizingMethod    => \&tabbed_sizing,  # for tab handling
    TailorLB => [
        ord("\t") => LB_SP,
        LEFT_QUOTES()  => LB_OP,
        RIGHT_QUOTES() => LB_CL,
    ],
);

sub wrap_line($) {
    my($text) = @_;
    $formatter->config(Newline => ("\n" . " " x 4));
    say $formatter->break($text);
}

sub wrap_paragraph($) {
    my ($text) = @_;
    $formatter->config(Newline => "\n");

    for (split /\R{2,}/, $text) {
        s/(?:(?![\N{NO-BREAK SPACE}\t])\p{White_Space})+/ /g;
        s/^\s+//;
        s/\s+$//;
        say $formatter->break($_), "\n";
    }

}

} # end UNITCHECK

UNITCHECK {

state $uh = new Unicode::Unihan;

sub char_inform(_) {

    state $seen = { };

    my $string = shift;
    for my $char ( split //, $string ) {
        # next if $seen->{$char}++;
        my $ci = charinfo(ord $char);
        my $name   = $ci->{name};
        my $script = $ci->{script};
        my $cat = $ci->{category};

        my $gcs = Unicode::GCString->new($char);
        my $columns = $gcs->columns();
        #next unless $columns == 2;

        printf " %s%s U+%04X %2s", $char, " " x (2 - $columns), ord($char), $cat;
        printf " %-6s %s\n", $script, $name;

        for my $lang (@langs) {
            my @data = $uh->$lang($char);
            next unless @data && $data[0];
            # dumb thing doesn't have the utf8 flag on
            printf "  %-12s %s\n", $lang, join(", ", map { utf8 } @data);
        }
    }
}


sub said($$) {
    my ($lang, $string) = @_;
    my @retlist = ();
    for my $char ( split //, $string ) {
        my @data = $uh->$lang($char);
        next unless @data && $data[0];
        my $best = lc utf8($data[0]);
        if ($best =~ /\d/) {
            $best = apply_tones($lang, $best);
        }
        for ($best) {
            s/\h.*//;
        }
        push @retlist, $best;
    }
    return join(" ", @retlist);
}

}  # end UNITCHECK


sub apply_tones($$) {
    my ($lang, $string) = @_;

    return $string unless $string =~ / \d \b /x;

    state $mandarin_tones = {
    # don't use COMBINING TONE MARKs because they don't evaporate when NFC'd
        1 => "\N{COMBINING MACRON}",            # 1 is macron 青 qīng qing1
        2 => "\N{COMBINING ACUTE ACCENT}",      # 2 is acute  藍 lán  lan2
        3 => "\N{COMBINING CARON}",             # 3 is caron  满 mǎn  man3
        4 => "\N{COMBINING GRAVE ACCENT}",      # 4 is grave  綠 lǜ   lü4
        5 => "",   # tone 5 doesn't transliterate
    };

    state $cantonese_supers = {
        1 => "\N{SUPERSCRIPT ONE}",
        2 => "\N{SUPERSCRIPT TWO}",
        3 => "\N{SUPERSCRIPT THREE}",
        4 => "\N{SUPERSCRIPT FOUR}",
        5 => "\N{SUPERSCRIPT FIVE}",
        6 => "\N{SUPERSCRIPT SIX}",
        7 => "\N{SUPERSCRIPT SEVEN}",
        8 => "\N{SUPERSCRIPT EIGHT}",
        9 => "\N{SUPERSCRIPT NINE}",
    };

    state $cantonese_tones = {
        1 => "\N{MODIFIER LETTER EXTRA-HIGH TONE BAR}",                                  # ˥
        2 => "\N{MODIFIER LETTER MID TONE BAR}\N{MODIFIER LETTER EXTRA-HIGH TONE BAR}",  # ˧˥
        3 => "\N{MODIFIER LETTER MID TONE BAR}",                                         # ˧
        4 => "\N{MODIFIER LETTER LOW TONE BAR}\N{MODIFIER LETTER EXTRA-LOW TONE BAR}",   # ˨˩
        5 => "\N{MODIFIER LETTER EXTRA-LOW TONE BAR}\N{MODIFIER LETTER MID TONE BAR}",   # ˩˧
        6 => "\N{MODIFIER LETTER LOW TONE BAR}",                                         # ˨
        7 => "\N{MODIFIER LETTER EXTRA-HIGH TONE BAR}",                                  # ˥
        8 => "\N{MODIFIER LETTER MID TONE BAR}",                                         # ˧
        9 => "\N{MODIFIER LETTER LOW TONE BAR}",                                         # ˨
    };

    my $tones = undef;

### Something is broken with given() here
###    given ($lang) {
###        when ("Mandarin")  { $tones = $mandarin_tones  }
###        when ("Cantonese") { $tones = $cantonese_tones }
###        default            { die "unexpected language" }
###    }

    if ($lang eq "Cantonese") {
        my ($tones, $supers) = ($string, $string);
        $tones  =~ s/(\d)\b/$cantonese_tones->{$1}/g;
        $supers =~ s/(\d)\b/$cantonese_supers->{$1}/g;
        return "$supers/$tones";
    }

    if ($lang ne "Mandarin") {
        die "unknown tone language $lang";
    }

    state $digits = join("", sort keys %$mandarin_tones);
    $string = NFD($string);
    $string =~ s{
                    (?<VOWEL>       (?=[aeiou]) \X      )
                \K  (?<CODA>        [:\w] *             )
                    (?<TONE>        (?=[$digits]) \d \b )
    }{
        $mandarin_tones->{ $+{TONE} }
        . $+{CODA}
    }gexo;

    return NFC($string);
}

sub utf8(_) {
    my $str = shift();
    return utf8::is_utf8($str)
        ? $str
        : decode("UTF-8", $str);
}

sub romaji(_) {
    my $kanji = shift();
    state $conv = new Lingua::JA::Romanize::Japanese;
    my $roman = $conv->chars($kanji);
    for ($roman) {
        s/ //g;
        s/(\p{Latin})\K\N{KATAKANA-HIRAGANA PROLONGED SOUND MARK}/$1/g;
    }
    return $roman;
}

sub hangul(_) {
    my $hangul = shift();
    state $conv = new Lingua::KO::Romanize::Hangul;
    my $roman = $conv->chars($hangul);
    return $roman;
}

sub pinyin(_) {
    my $chinese = shift();
    state $conv = new Lingua::ZH::Romanize::Pinyin;
    my $roman = $conv->chars($chinese);
    $roman =~ s{ / \w+ \b }{}gx;
    return apply_tones("Mandarin", $roman);
}

__END__

====CHINESE====

春曉 孟浩然	Ceon1 Hiu2 Maang6 Hou6jin4

春眠不覺曉,	Ceon1 min4 bat1 gok3 hiu2,

處處聞啼鳥。	cyu3 cyu3 man4 tai4 niu5.

夜來風雨聲,	Je6 loi4 fung1 jyu5 sing1,

花落知多少?	faa1 lok6 zi1 do1 siu2?

Due to its status as a prestige dialect, it often called "Standard 
Cantonese" (simplified Chinese: 标准粤语; traditional Chinese: 標準粵語; 
Jyutping: biu1zeon2 jyut6jyu5; Guangdong Romanization:Biu1 zên2 yud6 
yu5).

Chinese has a word 青 (qīng) that can refer to both, though it also has 
separate words for blue (蓝 / 藍, lán) and green (绿 / 綠, lǜ).

The modern Chinese language has the blue-green distinction (蓝/ 藍 lán 
for blue and 绿 / 緑 lǜ for green); however, another word which predates 
the modern vernacular, qīng (Chinese: 青), is also used. It can refer to 
either blue or green, or even (though much less frequently) to black, as 
in xuánqīng (Chinese: 玄青 where Chinese: 玄 refers to black). For 
example, the Flag of the Republic of China is today still referred to as 
qīng tiān, bái rì, mǎn dì hóng ("Blue Sky, White Sun, Whole Ground Red" 
— Chinese: 青天,白日,满地红); whereas qīng cài (Chinese: 青菜) is the 
Chinese word for "green vegetable". Qīng 青 was the traditional 
designation of both blue and green for much of the history of the 
Chinese language, while 蓝 lán and 绿 lǜ were introduced relatively more 
recently, as a part of the adoption of modern Vernacular Chinese as the 
social norm, replacing Classical Chinese.

The Chinese version is simply, 妈妈给我们的钱,我已经买了糖了. Māma gěi 
wǒmen de qián, wǒ yǐjīng mǎile táng le. This is translated somewhat 
directly as, "The money Mom gave us, I already bought candy," lacking a 
preface as in English.

Another major difference between the syntax of Chinese and languages 
like English lies in the stacking order of modifying clauses. 昨天发脾气
的外交警察取消了沒有交钱的那些人的入境证. Zuótiān fāpíqì de wàijiāo 
jǐngchá qǔxiāole méiyǒu jiāoqián de nàxiē rén de rùjìngzhèng. Using the 
Chinese order in English, that sentence would be [blah blah blah].

Tones http://en.wikipedia.org/wiki/Cantonese_phonology#Tones

Guangzhou Cantonese has seven tones, but Hong Kong Cantonese has six 
with the high-falling tone merging with the high tone. Although it is 
often said to have nine or eleven, the additional tones added in the 
counts are entering tones. In Chinese, the number of possible tones 
depends on the syllable type. There are six contour tones in syllables 
that end in a vowel or nasal consonant. (Some of these have more than 
one realization, but such differences are seldom used to distinguish 
words.) In syllables that end in a stop consonant, the number of tones 
is reduced to three; in Chinese descriptions, these "entering tones" are 
treated separately, so that Cantonese is traditionally said to have nine 
tones. However, phonetically these are a conflation of tone and syllable 
type; the number of phonemic tones is six in Hong Kong and seven in 
Guangzhou.

Syllable type Open syllables Stopped syllables

d.=dark, l.=light, U.=upper, L.=lower 	Tone name	d.level	 
d.rising	 d.depart.	 l.level	l.rising l.depart.	 
U. d.enter. L. d.enter. l.enter. 			(陰平)	 (陰上)	 
(陰去)	 (陽平)		(陽上)	 (陽去)	 (上陰入) (下陰入)	 (陽入) 
Description		high level medium	 medium	 low falling	
low	 low	 high		medium	 low 			high 
falling rising	 level	 v.l. level	rising	 level	 level	level	 
level Yale or Jyutping 	 tone number	1	 2		 3		 
4		5	 6		 7 (or 1)	8 (or 3) 9 (or 
6) Example		詩	 史	 試		 時		
市	 是	 識		錫	 食 Tone letter	siː˥, siː˥˧ 
siː˧˥	 siː˧	 siː˨˩, siː˩	siː˩˧	 siː˨	 sɪk˥	sɛːk˧	 sɪk˨ 
IPA diacritic	síː, sîː sǐː	 sīː		 si̖ː, sı̏ː	 si̗ː		 
sìː	 sɪ́k	 sɛ̄ːk	 sɪ̀k Yale diacritic	sī, sì	 sí	 si		 
sīh, sìh	síh	 sih	 sīk		sek	 sihk

For purposes of meters in Chinese poetry, the first and fourth tones are 
the "level tones" (平聲), while the rest are the "oblique tones" (仄聲).

The first tone can be either high level or high falling usually without 
affecting the meaning of the words being spoken. Most speakers are in 
general not consciously aware of when they use and when to use high 
level and high falling. In Hong Kong, most speakers have merged the high 
and high falling tones. In Guangzhou, the high falling tone is 
disappearing as well, but is still prevalent among certain words (saam-
high falling means the number three, where saam-high means shirt).

The numbers "394052786" when pronounced in Cantonese, will give the nine 
tones in order (Romanisation (Yale) saam1, gau2, sei3, ling4, ng5, yi6, 
chat7, baat8, luk9), thus giving a good mnemonic for remembering the 
nine tones.

CJK 春曉 is chūn xǐao in Mandarin.
CJK 春曉 is ceon¹/ceon˥ hiu²/hiu˧˥ in Cantonese.
CJK 春曉 is chūn xǐao in Pinyin.


CJK 孟浩然 is mèng hào rán in Mandarin.
CJK 孟浩然 is maang⁶/maang˨ hou⁵ jin⁴/jin˨˩ in Cantonese.
CJK 孟浩然 is mèng hào rán in Pinyin.


CJK 春眠不覺曉 is chūn mían bù júe xǐao in Mandarin.
CJK 春眠不覺曉 is ceon¹/ceon˥ min⁴/min˨˩ bat¹ gaau³ hiu²/hiu˧˥ in 
    Cantonese.
CJK 春眠不覺曉 is chūn mían bú jìao xǐao in Pinyin.


CJK 處處聞啼鳥 is chù chù wén tí nǐao in Mandarin.
CJK 處處聞啼鳥 is cyu² cyu² man⁴ tai⁴/tai˨˩ niu⁵/niu˩˧ in Cantonese.
CJK 處處聞啼鳥 is chǔ chǔ wén tí nǐao in Pinyin.


CJK 夜來風雨聲 is yè lái fēng yǔ shēng in Mandarin.
CJK 夜來風雨聲 is je⁶/je˨ lai⁴ fung¹ jyu⁵ seng¹ in Cantonese.
CJK 夜來風雨聲 is yè lái fēng yǔ shēng in Pinyin.


CJK 花落知多少 is hūa lùo zhī dūo shǎo in Mandarin.
CJK 花落知多少 is faa¹/faa˥ laai⁶ zi¹ do¹/do˥ siu² in Cantonese.
CJK 花落知多少 is hūa là zhī dūo shǎo in Pinyin.


CJK 标准粤语 is bīao zhǔn yùe yǔ in Mandarin.
CJK 标准粤语 is biu¹/biu˥ zeon²/zeon˧˥ jyut⁶/jyut˨ jyu⁵/jyu˩˧ in 
    Cantonese.
CJK 标准粤语 is biao zhǔn yue yu in Pinyin.


CJK 標準粵語 is bīao zhǔn yùe yǔ in Mandarin.
CJK 標準粵語 is biu¹/biu˥ zeon²/zeon˧˥ jyut⁶/jyut˨ jyu⁵ in Cantonese.
CJK 標準粵語 is bīao zhǔn yùe yǔ in Pinyin.


CJK 青 is qīng in Mandarin.
CJK 青 is ceng¹ in Cantonese.
CJK 青 is qīng in Pinyin.


CJK 蓝 is lán in Mandarin.
CJK 蓝 is laam⁴/laam˨˩ in Cantonese.
CJK 蓝 is la in Pinyin.


CJK 藍 is lán in Mandarin.
CJK 藍 is laam⁴/laam˨˩ in Cantonese.
CJK 藍 is lán in Pinyin.


CJK 绿 is lǜ in Mandarin.
CJK 绿 is luk⁶/luk˨ in Cantonese.
CJK 绿 is lu in Pinyin.


CJK 綠 is lǜ in Mandarin.
CJK 綠 is luk⁶/luk˨ in Cantonese.
CJK 綠 is lù: in Pinyin.


CJK 蓝 is lán in Mandarin.
CJK 蓝 is laam⁴/laam˨˩ in Cantonese.
CJK 蓝 is la in Pinyin.


CJK 藍 is lán in Mandarin.
CJK 藍 is laam⁴/laam˨˩ in Cantonese.
CJK 藍 is lán in Pinyin.


CJK 绿 is lǜ in Mandarin.
CJK 绿 is luk⁶/luk˨ in Cantonese.
CJK 绿 is lu in Pinyin.


CJK 緑 is lǜ in Mandarin.
CJK 緑 is  in Cantonese.
CJK 緑 is 緑 in Pinyin.


CJK 青 is qīng in Mandarin.
CJK 青 is ceng¹ in Cantonese.
CJK 青 is qīng in Pinyin.


CJK 玄青 is xúan qīng in Mandarin.
CJK 玄青 is jyun⁴/jyun˨˩ ceng¹ in Cantonese.
CJK 玄青 is xúan qīng in Pinyin.


CJK 玄 is xúan in Mandarin.
CJK 玄 is jyun⁴/jyun˨˩ in Cantonese.
CJK 玄 is xúan in Pinyin.


CJK 青天 is qīng tīan in Mandarin.
CJK 青天 is ceng¹ tin¹/tin˥ in Cantonese.
CJK 青天 is qīng tīan in Pinyin.


CJK 白日 is bái rì in Mandarin.
CJK 白日 is baak⁶/baak˨ jat⁶/jat˨ in Cantonese.
CJK 白日 is bái rì in Pinyin.


CJK 满地红 is mǎn dì hóng in Mandarin.
CJK 满地红 is mun⁵/mun˩˧ dei⁶ hung⁴/hung˨˩ in Cantonese.
CJK 满地红 is man de gong in Pinyin.


CJK 青菜 is qīng cài in Mandarin.
CJK 青菜 is ceng¹ coi³/coi˧ in Cantonese.
CJK 青菜 is qīng cài in Pinyin.


CJK 青 is qīng in Mandarin.
CJK 青 is ceng¹ in Cantonese.
CJK 青 is qīng in Pinyin.


CJK 蓝 is lán in Mandarin.
CJK 蓝 is laam⁴/laam˨˩ in Cantonese.
CJK 蓝 is la in Pinyin.


CJK 绿 is lǜ in Mandarin.
CJK 绿 is luk⁶/luk˨ in Cantonese.
CJK 绿 is lu in Pinyin.


CJK 妈妈给我们的钱 is mā mā gěi wǒ men de qían in Mandarin.
CJK 妈妈给我们的钱 is maa¹/maa˥ maa¹/maa˥ kap¹/kap˥ ngo⁵/ngo˩˧ mun⁴/
    mun˨˩ di¹ cin² in Cantonese.
CJK 妈妈给我们的钱 is ma ma gei wǒ men de qian in Pinyin.


CJK 我已经买了糖了 is wǒ yǐ jīng mǎi le táng le in Mandarin.
CJK 我已经买了糖了 is ngo⁵/ngo˩˧ ji⁵/ji˩˧ ging¹/ging˥ maai⁵/maai˩˧ liu⁵/
    liu˩˧ tong² liu⁵/liu˩˧ in Cantonese.
CJK 我已经买了糖了 is wǒ yǐ jing mai le táng le in Pinyin.


CJK 昨天发脾气的外交警察取消了沒有交钱的那些人的入境证 is zúo tīan fā pí 
    qì de wài jīao jǐng chá qǔ xīao le méi yǒu jīao qían de nà xīe rén de rù 
    jìng zhèng in Mandarin.
CJK 昨天发脾气的外交警察取消了沒有交钱的那些人的入境证 is zok³ tin¹/tin˥ 
    faat³/faat˧ pei⁴/pei˨˩ hei³/hei˧ di¹ ngoi⁶ gaau¹/gaau˥ ging²/ging˧˥ 
    caat³/caat˧ ceoi²/ceoi˧˥ siu¹/siu˥ liu⁵/liu˩˧ mut⁶/mut˨ jau⁵ gaau¹/gaau˥ 
    cin² di¹ aa⁶ se¹/se˥ jan⁴/jan˨˩ di¹ jap⁶/jap˨ ging²/ging˧˥ zing³/zing˧ 
    in Cantonese.
CJK 昨天发脾气的外交警察取消了沒有交钱的那些人的入境证 is zúo tīan fa pí 
    qì de wài jīao jǐng chá qǔ xīao le méi yǒu jīao qian de nǎ xīe rén de rù 
    jìng zheng in Pinyin.


CJK 陰平 is yīn píng in Mandarin.
CJK 陰平 is jam¹/jam˥ peng⁴ in Cantonese.
CJK 陰平 is yīn píng in Pinyin.


CJK 陰上 is yīn shàng in Mandarin.
CJK 陰上 is jam¹/jam˥ soeng⁵ in Cantonese.
CJK 陰上 is yīn shǎng in Pinyin.


CJK 陰去 is yīn qù in Mandarin.
CJK 陰去 is jam¹/jam˥ heoi² in Cantonese.
CJK 陰去 is yīn qù in Pinyin.


CJK 陽平 is yáng píng in Mandarin.
CJK 陽平 is joeng⁴/joeng˨˩ peng⁴ in Cantonese.
CJK 陽平 is yáng píng in Pinyin.


CJK 陽上 is yáng shàng in Mandarin.
CJK 陽上 is joeng⁴/joeng˨˩ soeng⁵ in Cantonese.
CJK 陽上 is yáng shǎng in Pinyin.


CJK 陽去 is yáng qù in Mandarin.
CJK 陽去 is joeng⁴/joeng˨˩ heoi² in Cantonese.
CJK 陽去 is yáng qù in Pinyin.


CJK 上陰入 is shàng yīn rù in Mandarin.
CJK 上陰入 is soeng⁵ jam¹/jam˥ jap⁶/jap˨ in Cantonese.
CJK 上陰入 is shǎng yīn rù in Pinyin.


CJK 下陰入 is xìa yīn rù in Mandarin.
CJK 下陰入 is haa⁵ jam¹/jam˥ jap⁶/jap˨ in Cantonese.
CJK 下陰入 is xìa yīn rù in Pinyin.


CJK 陽入 is yáng rù in Mandarin.
CJK 陽入 is joeng⁴/joeng˨˩ jap⁶/jap˨ in Cantonese.
CJK 陽入 is yáng rù in Pinyin.


CJK 詩 is shī in Mandarin.
CJK 詩 is si¹/si˥ in Cantonese.
CJK 詩 is shī in Pinyin.


CJK 史 is shǐ in Mandarin.
CJK 史 is si²/si˧˥ in Cantonese.
CJK 史 is shǐ in Pinyin.


CJK 試 is shì in Mandarin.
CJK 試 is si³ in Cantonese.
CJK 試 is shì in Pinyin.


CJK 時 is shí in Mandarin.
CJK 時 is si⁴/si˨˩ in Cantonese.
CJK 時 is shí in Pinyin.


CJK 市 is shì in Mandarin.
CJK 市 is si⁵/si˩˧ in Cantonese.
CJK 市 is shì in Pinyin.


CJK 是 is shì in Mandarin.
CJK 是 is si⁶/si˨ in Cantonese.
CJK 是 is shì in Pinyin.


CJK 識 is shi in Mandarin.
CJK 識 is sik¹ in Cantonese.
CJK 識 is shì in Pinyin.


CJK 錫 is xí in Mandarin.
CJK 錫 is sek³ in Cantonese.
CJK 錫 is xí in Pinyin.


CJK 食 is shí in Mandarin.
CJK 食 is ji⁶ in Cantonese.
CJK 食 is shí in Pinyin.


CJK 平聲 is píng shēng in Mandarin.
CJK 平聲 is peng⁴ seng¹ in Cantonese.
CJK 平聲 is píng shēng in Pinyin.


CJK 仄聲 is zè shēng in Mandarin.
CJK 仄聲 is zak¹/zak˥ seng¹ in Cantonese.
CJK 仄聲 is zè shēng in Pinyin.



====KOREAN====

The Korean word 푸르다 (pureuda) can mean either green or blue.

The native Korean word 푸르다 (Revised Romanization: pureu-da adj.) may 
mean either blue or green, or bluish green. This word 푸르다 is used as 
in 푸른 하늘 (pureun haneul, blue sky) for blue or as in 푸른 숲 (pureun 
sup, green forest) for green. Distinct words for blue and green are also 
used; 파란 (paran adj.), 파란색/파랑 (paransaek/parang n.) for blue, 
초록 (chorok adj./n.), 초록색 (choroksaek n. or for short, 녹색 noksaek 
n.) for green. However, in the case of a traffic light, paran is used 
for the green light meaning go, even though the word is typically used 
to mean blue. Cheong 청 is also used for both blue and green. It is a 
loan from Chinese (靑, pinyin: qing) and is used in the proper name 
Cheong Wa Dae (청와대 or Hanja: 靑瓦臺), the Blue House, which is the 
executive office and official residence of the President of the Republic 
of Korea.

CJK 靑 is cheng in Korean.
CJK 靑瓦臺 is cheng wa tay in Korean.
Korean 푸르다 is pu reu da.
Korean 푸르다 is pu reu da.
Korean 푸르다 is pu reu da.
Korean 푸른 is pu reun.
Korean 하늘 is ha neul.
Korean 푸른 is pu reun.
Korean 숲 is sup.
Korean 파란 is pa ran.
Korean 파란색 is pa ran saek.
Korean 파랑 is pa rang.
Korean 초록 is cho rok.
Korean 초록색 is cho rok saek.
Korean 녹색 is nok saek.
Korean 청 is cheong.
Korean 청와대 is cheong wa dae.

====JAPANESE====



The Japanese language is written with a combination of three scripts: 
Chinese characters called kanji (漢字), and two syllabic scripts made up 
of modified Chinese characters, hiragana (ひらがな or 平仮名) and 
katakana (カタカナ or 片仮名). The Latin alphabet, rōmaji (ローマ字), is 
also often used in modern Japanese, especially for company names and 
logos, advertising, and when entering Japanese text into a computer.

The main distinction in Japanese accents is between Tokyo-type 	(東京式 	
Tōkyō-shiki?) and Kyoto-Osaka-type 	(京阪式 	Keihan-shiki?), 
though Kyūshū-type dialects form a third, smaller group.

In Japanese, the word for blue (青 ao) is often used for colors that 
English speakers would refer to as green, such as the color of a traffic 
signal meaning "go".

The Japanese word ao (青, n., aoi (青い, adj.)), exactly the same kanji 
character as the Chinese qīng above, can refer to either blue or green 
depending on the situation. Modern Japanese has also adopted the Chinese 
word for green (緑 midori), although this was not always so. Ancient 
Japanese did not have this distinction: the word midori only came into 
use in the Heian period, and at that time (and for a long time 
thereafter) midori was still considered a shade of ao. Educational 
materials distinguishing green and blue only came into use after World 
War II, during the Occupation[citation needed]: thus, even though most 
Japanese consider them to be green, the word ao is still used to 
describe certain vegetables, apples and vegetation. Ao is also the name 
for the color of a traffic light, which is bluer than in English-
speaking countries. However, most other objects—a green car, a green 
sweater, and so forth—will generally be called midori. Japanese people 
also sometimes use the word guriin (グリーン), based on the English word 
"green", for colors. The language also has several other words meaning 
specific shades of green and blue.

CJK 漢字 is kara aza in Japanese.
CJK 平仮名 is taira kari na in Japanese.
CJK 片仮名 is kata kari na in Japanese.
CJK 字 is aza in Japanese.
CJK 東京式 is higashi miyako nori in Japanese.
CJK 京阪式 is miyako saka nori in Japanese.
CJK 青 is ao in Japanese.
CJK 青 is ao in Japanese.
CJK 青 is ao in Japanese.
CJK 緑 is midori in Japanese.

Japanese 漢字 is kanji.
Japanese ひらがな is hiragana.
Japanese 平仮名 is hiragana/hirakana.
Japanese カタカナ is katakana.
Japanese 片仮名 is katakana.
Japanese ローマ字 is roomaaza/azana/ji.
Japanese 東京式 is toukyoushiki.
Japanese 京阪式 is keihanshiki.
Japanese 青 is ao/sei/shou.
Japanese 青 is ao/sei/shou.
Japanese 青い is aoi.
Japanese 緑 is midori/roku/ryoku.
Japanese グリーン is guriin.