The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.
NAME
    Lingua::JA::NormalizeText - Text Normalizer

SYNOPSIS
      use Lingua::JA::NormalizeText;
      use utf8;

      my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu );
      my $normalizer = Lingua::JA::NormalizeText->new(@options);

      print $normalizer->normalize('鳥が㌧㌦でありんす♥');
      # -> 鳥がトンドルです♥

      sub dearinsu_to_desu
      {
          my $text = shift;
          $text =~ s/でありんす/です/g;

          return $text;
      }

    # or

      use Lingua::JA::NormalizeText qw/old2new_kanji/;
      use utf8;

      print old2new_kanji('惡の華');
      # -> 悪の華

DESCRIPTION
    Lingua::JA::NormalizeText normalizes text.

METHODS
  new(@options)
    Creates a new Lingua::JA::NormalizeText instance.

    The following options are available:

      OPTION                 SAMPLE INPUT           OUTPUT FOR SAMPLE INPUT
      ---------------------  ---------------------  -----------------------
      lc                     DdD                    ddd
      uc                     DdD                    DDD
      nfkc                   ㌦                     ドル (length: 2)
      nfkd                   ㌦                     ドル (length: 3)
      nfc
      nfd
      decode_entities        ♥               ♥
      strip_html             <em>あ</em>                あ    
      alnum_z2h              ABC123           ABC123
      alnum_h2z              ABC123                 ABC123
      space_z2h
      space_h2z
      katakana_z2h           ハァハァ               ハァハァ
      katakana_h2z           スーハースーハー               スーハースーハー
      katakana2hiragana      パンツ                 ぱんつ
      hiragana2katakana      ぱんつ                 パンツ
      wave2tilde             〜, 〰                 ~
      tilde2wave             ~                     〜
      wavetilde2long         〜, 〰, ~             ー
      wave2long              〜, 〰                 ー
      tilde2long             ~                     ー
      fullminus2long         -                     ー
      dashes2long            —                      ー
      drawing_lines2long     ─                      ー
      unify_long_repeats     ヴァーーー             ヴァー
      nl2space               (LF)(CR)(CRLF}         (space)(space)(space)
      unify_nl               (LF)(CR)(CRLF)         \n\n\n
      unify_long_spaces      あ(space)(space)あ     あ(space)あ
      unify_whitespaces      \x{00A0}               (space)
      trim                   (space)あ(space)あ(space)  あ(space)あ
      ltrim                  (space)あ(space)       あ(space)
      rtrim                  ああ(space)(space)     ああ
      old2new_kana           ゐヰゑヱヸヹ           いイえエイ゙エ゙
      old2new_kanji          亞逸鬭                 亜逸闘
      tab2space              (tab)(tab)             (space)(space)
      remove_controls        あ\x{0000}あ           ああ
      remove_spaces          (space)あ(space)あ(space)  ああ
      dakuon_normalize       さ\x{3099}             ざ
      handakuon_normalize    は\x{309A}             ぱ
      all_dakuon_normalize   さ\x{3099}は\x{309A}   ざぱ

    The order in which these options are applied is according to the order
    of the elements of @options. (i.e., The first element is applied first,
    and the last element is applied last.)

    External functions are also addable. (See dearinsu_to_desu function of
    the SYNOPSIS section.)

  normalize($text)
    normalizes $text.

OPTIONS
  dashes2long
    Note that this option does not convert hyphens into long.

  drawing_line2long
    This option converts drawing lines which are similar to long(U+30FC) in
    appearance.

  unify_long_spaces
    Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC
    SPACE(U+3000).

  remove_controls
    Note that this option does not remove the following characters:

      CHARACTER TABULATION
      LINE FEED
      CARRIAGE RETURN

  remove_spaces
      Note that this option removes only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000).

  unify_whitespaces
    This option converts the following characters into SPACE(U+0020).

      LINE TABULATION
      FORM FEED
      NEXT LINE
      NO-BREAK SPACE
      OGHAM SPACE MARK
      MONGOLIAN VOWEL SEPARATOR
      EN QUAD
      EM QUAD
      EN SPACE
      EM SPACE
      THREE-PER-EM SPACE
      FOUR-PER-EM SPACE
      SIX-PER-EM SPACE
      FIGURE SPACE
      PUNCTUATION SPACE
      THIN SPACE
      HAIR SPACE
      LINE SEPARATOR
      PARAGRAPH SEPARATOR
      NARROW NO-BREAK SPACE
      MEDIUM MATHEMATICAL SPACE

    Note that this does not convert the following characters:

      CHARACTER TABULATION
      LINE FEED
      CARRIAGE RETURN
      IDEOGRAPHIC SPACE

AUTHOR
    pawa <pawapawa@cpan.org>

SEE ALSO
    新旧字体表: <http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html>

    Lingua::JA::Regular::Unicode

    Lingua::JA::Dakuon

    Lingua::JA::Moji

    Unicode::Normalize

    HTML::Entities

    HTML::Scrubber

LICENSE
    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.