鄭智中 > Unicode-Property-XS-0.81 > Unicode::Property::XS

Download:
Unicode-Property-XS-0.81.tar.gz

Dependencies

Annotate this POD (1)

CPAN RT

New  1
Open  0
View/Report Bugs
Module Version: 0.81   Source  

NAME ^

Unicode::Property::XS - Unicode properties implemented by lookup table in C code.

SYNOPSIS ^

  use Unicode::Property::XS qw(:all); # 'ucs_' is the default prefix

  my @property_letters;
  foreach my $ord (0x0000..0x37FF) { 
      push @property_letters, ucs_L($ord);    # /\p{L}/ 
  };
  my @property_list = ucs_EaFullwidth1(0x0000..0x37FF);

  foreach my $ord (0x0000..0x3FFFF) {
      next if !ucs_Legal($ord);
      die "Internal error!" if ucs_M($ord) != ((chr($ord) =~ /\p{M}/) ? 1 : 0);
  }

  my @myChars = q( a b c d e f g 1 2 3 );
  my @property_list2 = ucs_L( ord(@myChars) );

  __END__

  #################################
    
  BEGIN { Unicode::Property::XS::Prefix = 'Is'; }
  use Unicode::Property::XS;
  
  my @property_letters;
  foreach my $ord (0x0000..0x37FF) { 
      push @property_letters, IsL($ord);    # /\p{L}/ 
  };
   
  __END__

   #################################

   use Unicode::Property::XS qw( Legal :EastAsianWidth );
   use Unicode::EastAsianWidth;
   BEGIN { $Unicode::EastAsianWidth::EastAsian = 0; };

   foreach my $ord (0x0000..0xEFFFF) {
       next if !ucs_Legal($ord) ; 
       my $lookup_value = ucs_EaFullwidth0($ord);    # /\p{InFullwidth}
       my $re_value = chr($ord)=~/\p{InFullwidth}/ ;
       die "Error in Unicode::Property::XS!\n" if !($lookup_value == $re_value) ;
   };

   __END__

DESCRIPTION ^

Unicode properties for regular expression in perl is handy. But it's somehow slow when the times of repetition is sparse for a given word. So, I made a table lookup XS module for property lookup. The "Unicoae Character Properties" section of perlunicode and properties in Unicode::EastAsianWidth is implemented.

The bundle costs 1.2MB for run time dynamic library, and include all the property class listed below. please tell me if you module-spliting or space-saving solutions.

All the functions except ucs_Legal() work the same way. Return 1 if the input character (in numeric value) is in that property class. Return 0 if not. Return 0 if the encoding value is illegal (should not happen if the input value is converted by ord($ucs_char)). Return 15 if in plane 15, a user-defined plane. Return 16 if in plane 16, a user-defined plane.

And ucs_Legal() returns 1 if perl will not complain chr($ucs_ord), and 0, otherwise.

The following functions can be exported to the caller's scope. ucs_Legal().

Functions for general properties: ucs_L(), ucs_LC(), ucs_Lu(), ucs_Ll(), ucs_Lt(), ucs_Lm(), ucs_Lo(), ucs_M(), ucs_Mn(), ucs_Mc(), ucs_Me(), ucs_N(), ucs_Nd(), ucs_Nl(), ucs_No(), ucs_P(), ucs_Pc(), ucs_Pd(), ucs_Ps(), ucs_Pe(), ucs_Pi(), ucs_Pf() ucs_Po(), ucs_S(), ucs_Sm(), ucs_Sc(), ucs_Sk(), ucs_So(), ucs_Z(), ucs_Zs(), ucs_Zl(), ucs_Zp(), ucs_C(), ucs_Cc(), ucs_Cf(), ucs_Cs(), ucs_Co(), ucs_Cn(),

Functions for bidirectional properties: ucs_BidiL(), ucs_BidiLRE(), ucs_BidiLRO(), ucs_BidiR(), ucs_BidiAL(), ucs_BidiRLE(), ucs_BidiRLO(), ucs_BidiPDF(), ucs_BidiEN(), ucs_BidiES(), ucs_BidiET(), ucs_BidiAN(), ucs_BidiCS(), ucs_BidiNSM(), ucs_BidiBN(), ucs_BidiB(), ucs_BidiS(), ucs_BidiWS(), ucs_BidiON().

Functions for scripts ( properties PhagsPa, Phoenician, are not included since they are not implemented in /\p{ }/ form. ): ucs_Arabic(), ucs_Armenian(), ucs_Balinese(), ucs_Bengali(), ucs_Bopomofo(), ucs_Braille(), ucs_Buginese(), ucs_Buhid(), ucs_CanadianAboriginal(), ucs_Cherokee(), ucs_Coptic(), ucs_Cuneiform(), ucs_Cypriot(), ucs_Cyrillic(), ucs_Deseret(), ucs_Devanagari(), ucs_Ethiopic(), ucs_Georgian(), ucs_Glagolitic(), ucs_Gothic(), ucs_Greek(), ucs_Gujarati(), ucs_Gurmukhi(), ucs_Han(), ucs_Hangul(), ucs_Hanunoo(), ucs_Hebrew(), ucs_Hiragana(), ucs_Inherited(), ucs_Kannada(), ucs_Katakana(), ucs_Kharoshthi(), ucs_Khmer(), ucs_Lao(), ucs_Latin(), ucs_Limbu(), ucs_LinearB(), ucs_Malayalam(), ucs_Mongolian(), ucs_Myanmar(), ucs_NewTaiLue(), ucs_Nko(), ucs_Ogham(), ucs_OldItalic(), ucs_OldPersian(), ucs_Oriya(), ucs_Osmanya(), ucs_PhagsPa(), ucs_Phoenician(), ucs_Runic(), ucs_Shavian(), ucs_Sinhala(), ucs_SylotiNagri(), ucs_Syriac(), ucs_Tagalog(), ucs_Tagbanwa(), ucs_TaiLe(), ucs_Tamil(), ucs_Telugu(), ucs_Thaana(), ucs_Thai(), ucs_Tibetan(), ucs_Tifinagh(), ucs_Ugaritic(), ucs_Yi().

Functions for extended properties: ucs_ASCIIHexDigit(), ucs_BidiControl(), ucs_Dash(), ucs_Deprecated(), ucs_Diacritic(), ucs_Extender(), ucs_HexDigit(), ucs_Hyphen(), ucs_Ideographic(), ucs_IDSBinaryOperator(), ucs_IDSTrinaryOperator(), ucs_JoinControl(), ucs_LogicalOrderException(), ucs_NoncharacterCodePoint(), ucs_OtherAlphabetic(), ucs_OtherDefaultIgnorableCodePoint(), ucs_OtherGraphemeExtend(), ucs_OtherIDStart(), ucs_OtherIDContinue(), ucs_OtherLowercase(), ucs_OtherMath(), ucs_OtherUppercase(), ucs_PatternSyntax(), ucs_PatternWhiteSpace(), ucs_QuotationMark(), ucs_Radical(), ucs_SoftDotted(), ucs_STerm(), ucs_TerminalPunctuation(), ucs_UnifiedIdeograph(), ucs_VariationSelector(), ucs_WhiteSpace().

Functions for derived properties: ucs_Alphabetic(), ucs_Lowercase(), ucs_Uppercase(), ucs_Math(), ucs_IDStart(), ucs_IDContinue(), ucs_Any(), ucs_Assigned(), ucs_Unassigned(), ucs_ASCII(), ucs_Common().

Functions for EastAsianWidth: ucs_EaF(), ucs_EaH(), ucs_EaA(), ucs_EaNa(), ucs_EaW(), ucs_EaN(), ucs_EaFullwidth0(), ucs_EaHalfwidth0(), ucs_EaFullwidth1(), ucs_EaHalfwidth1().

While considering about classification of InEastAsianAmbiguous category in InFullwidth and InHalfwidth, ucs_EaFullwidth0() and ucs_EaHalfwidth0() represent the InFullwidth class and InHalfwidth class with $Unicode::EastAsianWidth::EastAsian = 0. On the contrary, ucs_EaFullwidth1() and ucs_EaHalfwidth1() with $Unicode::EastAsianWidth::EastAsian = 1. The actual value of $Unicode::EastAsianWidth::EastAsian is irrelevant to them since the lookup table is premade.

In my line-warping program, the total running time is cut half by using this module, comparing to original regex version, i.e. /\p{ }/ family. At the same time, caching the regex result doesn't help much. But it shows only 20%-50% performance difference in benchmark module.

EXPORT

SEE ALSO ^

# Mention other useful documentation such as the documentation of # related modules or operating system documentation (such as man pages # in UNIX), or any relevant external documentation such as RFCs or # standards.

# If you have a mailing list set up for your module, mention it here.

# If you have a web site set up for your module, mention it here.

perlunicode, Unicode::EastAsianWidth, http://www.unicode.org/unicode/reports/tr11/, http://unicode.org/Public/UNIDATA/EastAsianWidth.txt

AUTHOR ^

Mindos Cheng, <mindos@gmail.com>

COPYRIGHT AND LICENSE ^

Copyright (C) 2008-2009 by Mindos Cheng

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.9 or, at your option, any later version of Perl 5 you may have available.

syntax highlighting: