The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::JA::NormalizeText - Text Normalizer

SYNOPSIS

  use Lingua::JA::NormalizeText;
  use utf8;

  my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
  my $normalizer = Lingua::JA::NormalizeText->new(@options);

  print $normalizer->normalize('鳥が㌧㌦でありんす♥');
  # -> 鳥がトンドルです♥

  sub dearinsu_to_desu
  {
      my $text = shift;
      $text =~ s/でありんす/です/g;

      return $text;
  }

# or

  use Lingua::JA::NormalizeText qw/nfkc decode_entities/;
  use utf8;

  my $text = '㈱㋰㋫㋫♥';
  print decode_entities( nfkc($text) );
  # -> (株)ムフフ♥

DESCRIPTION

Lingua::JA::NormalizeText normalizes text.

METHODS

new(@options)

Creates a new Lingua::JA::NormalizeText instance.

The following options are available.

  OPTION                 SAMPLE INPUT        OUTPUT FOR SAMPLE INPUT
  ---------------------  ------------------  -----------------------
  lc                     DdD                 ddd
  uc                     DdD                 DDD
  nfkc                   ㌦                  ドル (length: 2)
  nfkd                   ㌦                  ドル (length: 3)
  nfc
  nfd
  decode_entities        ♥            ♥
  strip_html             <em>あ</em>             あ    
  alnum_z2h              ABC123        ABC123
  alnum_h2z              ABC123              ABC123
  space_z2h
  space_h2z
  katakana_z2h           ハァハァ            ハァハァ
  katakana_h2z           スーハースーハー            スーハースーハー
  katakana2hiragana      パンツ              ぱんつ
  hiragana2katakana      ぱんつ              パンツ
  unify_3dots            はぁ。。。          はぁ…
  wave2tilde             〜                  ~
  tilde2wave             ~                  〜
  wavetilde2long         〜, ~              ー
  wave2long              〜                  ー
  tilde2long             ~                  ー
  fullminus2long         −                   ー
  dashes2long            —                   ー
  drawing_lines2long     ─                   ー
  unify_long_repeats     ヴァーーー          ヴァー
  nl2space               (new line)          (space)
  unify_long_spaces      (space)(space)      (space)
  remove_head_space      (space)あ(space)あ  あ(space)あ
  remove_tail_space      ああ(space)(space)  ああ
  old2new_kana           ゐヰゑヱ            いイえエ
  old2new_kanji          亞逸鬭              亜逸闘
  tab2space              (tab)(tab)          (space)(space)
  remove_controls        あ\x{0000}あ        ああ

The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.)

External functions are also addable. (See dearinsu_to_desu function of SYNOPSIS section.)

remove_controls

Note that this option does not remove the following chars:

  CHARACTER TABULATION(tab)
  LINE FEED(LF)
  CARRIAGE RETURN(CR)

normalize($text)

normalizes $text.

AUTHOR

pawa <pawapawa@cpan.org>

SEE ALSO

新旧字体表: http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.