
Lingua::JA::FindDates - scan text to find Japanese dates

Find and replace Japanese dates in a string.
use Lingua::JA::FindDates 'subsjdate';
Given a string, find and substitute all the Japanese dates in it.
my $dates = '昭和41年三月16日'; print subsjdate ($dates);
prints
March 16, 1966
This module finds dates and substitutes dates inside a string:
my $dates = 'blah blah blah 三月16日'; print subsjdate ($dates);
prints
blah blah blah March 16
It can call back a routine each time a date is found:
sub replace_callback
{
my ($data, $before, $after) = @_;
print "$before was replaced by $after\n";
}
my $dates = '三月16日';
my $data = 'xyz'; # something to send to replace_callback
subsjdate ($dates, {replace => \&replace_callback, data => $data});
prints
三月16日 was replaced by March 16
Use any routine to format the date any way:
sub my_date
{
my ($date) = @_;
return join '/', $date->{month}."/".$date->{date};
}
my $dates = '三月16日';
print subsjdate ($dates, {make_date => \&my_date});
prints
3/16

This module uses a set of regular expressions to detect Japanese-style dates in a string. Dates includes year/month/day-style dates such as 平 成20年七月十日 Heisei nijuunentooka, but may also include combinations such as years alone, years and months, month and day without a year, fiscal years, parts of the month like 中旬 (chuujun), and periods between two dates.
This module has been road-tested on hundreds of documents, and it can cope with virtually any kind of common Japanese date. If you find any date which it can't cope with, please report that as a bug.
If you would like to see more examples of how this module works, look at the testing code in t/Lingua-JA-FindDates.t.
This module exports one function, subsjdate, on request.
kanji2number is a very simple kanji number convertor. Its input is one string of kanji numbers only, like '三十一'. It can deal with kanji numbers with or without ten/hundred/thousand kanjis. The return value is the numerical value of the kanji number, like 31, or zero if it can't read the number.
This function is not exported.
kanji2number only goes up to thousands, because usually dates only go that far. If you need a comprehensive Japanese number convertor, use Lingua::JA::Numbers instead of this. Also, it doesn't deal with mixed kanji and arabic numbers.
This is the default date making routine. It is not exported, because the user will substitute his or her own routine.
subsjdate, given a date like 平成20年7月3日(木), passes make_date a hash reference with values (year =2008, month => 7, date => 3, wday => 4)> for the year, month, date and day of the week. make_date returns a string, 'Thursday, July 3, 2008'. If some fields of the date aren't defined, for example in the case of a date like 7月3日 (3rd July), the hash values for the keys of the unknown parts of the date, such as year or weekday, will be undefined.
You can use any other format for the date by supplying a make_date callback to subsjdate.
This function is called when an interval of two dates, such as 平成3年 7月2日〜9日, is detected. It makes a string to represent that interval in English. It takes two arguments, hash references to the first and second date. The hash references are in the same format as make_date.
This function is not exported. It is the default used by subsjdate. You can use another function instead of this default by supplying a value make_date_interval as a callback in subsjdate.
If you want to see what the module is doing, set
$Lingua::JA::FindDates::verbose = 1;
This makes subsjdate print out each regular expression and reports whether it matched, which looks like this:
Looking for y in ([0-90-9]{4}|[十六七九五四千百二一八三]?千[十六七九五四千百二一八三]*)\s*年
Found '千九百六十六年': Arg 0: 1966 -> '1966'
"subsjdate", given a string (argument 1) containing some text like 平 成20年7月3日(木), looks through the string using a set of regular expressions, and if it finds anything, it calls make_date to make the equivalent date in English, and then substitutes it into $text:
$text =~ s/平成20年7月3日(木)/Thursday, July 3, 2008/g;
Users can supply a different date making function. See below.
A string, encoded in Perl's internal encoding.
The hash reference $callbacks can take the following items:
If there is a replace value in the callbacks, subsjdate calls it as a subroutine with the data in $callbacks-{data}> and the before and after string.
Any data you want to pass to the replace callback.
This is a replacement for the make_date function. If you don't need to replace the default (if you want American-style dates), you can leave this blank. If, for example, you want dates in the form "Th 2008/7/3", you could write a routine like the following:
sub mymakedate
{
my ($date) = @_;
return qw{Bad Mo Tu We Th Fr Sa Su}[$date->{wday}].
$date->{year}.'/'.$date->{month}.'/'.$date->{date};
}
Note that you need to check for the hash values for year, month, date, and wday being zero, since subsjdate matches "month/day" and "year/month" only dates.
This is a replacement for the make_date_interval function. Its arguments are two dates.
It does not detect that dates like 昭和百年 (Showa 100, an impossible year) are invalid.
The date matching only goes back to the Meiji era. There is DateTime::Calendar::Japanese::Era if you need to go back further.
The dates returned won't be in the order that they are in the text, but in the order that they are found by the regular expressions, which means that in a string with two dates, the callbacks might be called for the second date before they are called for the first one. Basically the longer forms of dates are searched for before the shorter ones.
This module only understands Japanese encoded in Perl's internal form (UTF-8).
If you send subsjdate a string which is pure ASCII, you'll get a stream of warning messages about "uninitialized value". The error messages are wrong - this is actually a bug in Perl, reported as bug number 56902 (http://rt.perl.org/rt3/Public/Bug/Display.html?id=56902). But sending this routine a string which is pure ASCII doesn't make sense anyway, so don't worry too much about it.
This date (another way to write "1st January") is a little difficult, since the characters which make it up could also occur in other contexts, like 元日本軍 gennihongun, "the former Japanese military". Correctly parsing it requires a linguistic analysis of the text, which this module isn't able to do.

Ben Bullock, benkasminbullock@gmail.com

These other modules might be more suitable for some purposes:
This does the minimal stuff to make a Japanese date. One of those modules which has been made for completeness rather than for usefulness, it doesn't represent Japanese language usages very well, failing to contain Japanese eras, kanji numbers, wide numbers, etc.
This parses Japanese dates. Unlike the present module it claims to also format them, so it can turn a DateTime object into a Japanese date, and it also does times. However, the module seems to be broken - it doesn't install on any system I've tried.
This module has a very full set of kanji / numeral convertors. It converts numbers including decimal points and numbers into the billions and trillions.
This module contains a full set of Japanese eras.

Copyright (C) 2008 Ben Kasmin Bullock.
This module is distributed under the same terms as Perl itself, either Perl version 5.10.0 or, at your option, any later version of Perl 5 you may have available.