The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::GenderFromName - Guess the gender of an American first name.

SYNOPSIS

    use Text::GenderFromName;

    print gender("Jon");    # prints 'm'

    See EXAMPLES for additional uses.

DESCRIPTION

This module provides gender(), which takes a name and returns one of three values: 'm' for male, 'f' for female, or undef for unknown.

CHANGES

Version 0.30 is a significant departure from previous versions. By default, version 0.30 uses the U.S. Social Security Administration's "Most Popular Names of the 1980's" list of 1001 male first names and 1013 female first names. See CAVEATS below for details on this list.

Version 0.30 also allows for arbitrary female and male hashed lists to be provided at run-time, and includes several built-ins to provide matches based on exclusivity, weight, metaphones, and both version 0.20 and version 0.10 regexp-style matching. The user can also specify additional match subroutines and change the match order at run-time.

EXPORT

The single exported function is:

gender ($name [, $looseness])

Returns one of three values: 'm' for male, 'f' for female, or undef for unknown. gender() also accepts a "looseness" level: the higher the looseness value, the broader the match. See THE MATCH LIST below for details.

NON-EXPORT

The non-exported matching subs are:

one_only ($name)

Returns 'm' or 'f' if and only if $name is found in only one of the two lists.

either_weight ($name)

Returns 'm' or 'f' if $name is found in either list. If $name is in both lists, it returns the more heavily weighted of the two.

one_only_metaphone ($name)

Uses Text::DoubleMetaphone for comparison. Returns 'm' or 'f' if and only if the metaphone for $name is found in only one of the two lists.

Note that this function builds a copy of the female/male name lists to speed up the metaphone lookup.

either_weight_metaphone ($name)

Uses Text::DoubleMetaphone for comparison. Returns 'm' or 'f' if $name is found in either list. If $name is in both lists, it sums the weights of all matching metaphones and returns the larger of the two.

Note that this function builds a copy of the female/male name lists to speed up the metaphone lookup.

v2_rules ($name)

Uses Jon Orwant's v0.20 rules for matching.

v1_rules ($name)

Uses Jon Orwant's adaptation of Scott Pakin's awk script from v0.10 for matching.

If you wish to use your own hash refs containing names and weights, you should explicitly import:

gender_init ($female_names_ref, $male_names_ref)

Initializes the male and female hashes. This package calls gender_init() internally: without arguments it uses the table provided by the U.S. Social Security Administration. Don't call this function unless you want to override the supplied lists. See THE FEMALE/MALE HASHES below for details.

THE MATCH LIST

@MATCH_LIST contains the list of subs gender will use to determine the gender of a given name.

By default, there are 6 items in @MATCH_LIST, corresponding to the non-exported functions above. Strictly matching subs should go first, loosely matching subs should go last, as gender will iterate over the list from 0 to the specified looseness value or the number of subs in @MATCH_LIST, whichever comes first.

You may override this like so:

    @Text::GenderFromName::MATCH_LIST = ('main::my_matching_routine');

THE FEMALE/MALE HASHES

By default, these hashes are built using data from the U.S. SSA. You may override them by calling gender_init() with your own female and male hash refs, like so:

    use Text::GenderFromName qw( :DEFAULT &gender_init );

    my %females = ('barbly' => 4.1, 'bar' => 2.3, ...);
    my %males   = ('foobly' => 4.5, 'foo' => 1.3, ...);

    &gender_init(\%females, \%males);

The hash keys are lowercase names, and their values are their relative weights. This allows for names that could be male or female, but are more often one or the other.

EXAMPLES

Very strict usage:

    use Text::GenderFromName;

    my @names = ('Josephine', 'Michael', 'Dondi', 'Jonny',
                 'Pascal', 'Velvet', 'Eamon', 'FLKMLKSJN');

    for (@names) {
        # Use strict matching
        my $gender = &gender($_) || '';

        if    ($gender eq 'f') { print "$_: Female\n" }
        elsif ($gender eq 'm') { print "$_: Male\n"   }
        else                   { print "$_: UNSURE\n" }
    }

returns:

    Josephine: Female
    Michael: UNSURE
    Dondi: UNSURE
    Jonny: UNSURE
    Pascal: UNSURE
    Velvet: UNSURE
    Eamon: UNSURE
    FLKMLKSJN: UNSURE

Loose matching:

    for (@names) {
        # Use loose matching
        my $gender = &gender($_, 9) || '';
    ...

returns:

    Josephine: Female
    Michael: Male
    Dondi: Male
    Jonny: Male
    Pascal: Male
    Velvet: Female
    Eamon: UNSURE
    FLKMLKSJN: UNSURE

Turn on debugging:

    $Text::GenderFromName::DEBUG = 1;

returns:

    Matching "josephine":
            one_only...
            ==> HIT (f)

    Matching "michael":
            one_only...
            either_weight...
            F: 0.0271266376105491, M: 3.4091409099979
            ==> HIT (m)

    Matching "dondi":
            one_only...
            either_weight...
            one_only_metaphone...
            M: dondi => dante => TNT: 0.020568
            ==> HIT (m)

    Matching "jonny":
            one_only...
            either_weight...
            one_only_metaphone...
            F: jonny => jenna => JN: 0.193945
            M: jonny => john => JN: 1.629871
            either_weight_metaphone...
            F: jonny => jenna => JN: 0.193945
            F: jonny => joanna => JN: 0.118652
            F: jonny => jenny => JN: 0.104875
            ...
            M: jonny => john => JN: 1.629871
            M: jonny => juan => JN: 0.309234
            M: jonny => johnny => JN: 0.127193
            ...
            ==> HIT (m)

    Matching "pascal":
            one_only...
            either_weight...
            one_only_metaphone...
            either_weight_metaphone...
            v2_rules...
            ==> HIT (m)

    Matching "velvet":
            one_only...
            either_weight...
            one_only_metaphone...
            either_weight_metaphone...
            v2_rules...
            v1_rules...
            ==> HIT (f)

    Matching "eamon":
            one_only...
            either_weight...
            one_only_metaphone...
            either_weight_metaphone...
            v2_rules...
            v1_rules...

    Matching "flkmlksjn":
            one_only...
            either_weight...
            one_only_metaphone...
            either_weight_metaphone...
            v2_rules...
            v1_rules...

    Josephine: Female
    Michael: Male
    Dondi: Male
    Jonny: Male
    Pascal: Male
    Velvet: Female
    Eamon: UNSURE
    FLKMLKSJN: UNSURE

Add your own match sub:

    push @Text::GenderFromName::MATCH_LIST, 'main::eamon_hack';

    sub eamon_hack {
        my $name = shift;
        return 'm' if $name =~ /^eamon/;
    }

returns:

    ...
    Matching "eamon":
            one_only...
            either_weight...
            one_only_metaphone...
            either_weight_metaphone...
            v2_rules...
            v1_rules...
            main::eamon_hack...
            ==> HIT (m)

    Eamon: Male

Don't use metaphones:

    @Text::GenderFromName::MATCH_LIST =
      grep !/metaphone/, @Text::GenderFromName::MATCH_LIST;

Use your own female/male hash lists:

    use Text::GenderFromName qw( :DEFAULT &gender_init );

    my %females = ('josephine' => 2.1);
    my %males = ('dondi' => 4.5);
    &gender_init(\%females, \%males);

Use female/male hash lists from a database:

    use Text::GenderFromName qw( :DEFAULT &gender_init );

    use Tie::RDBM;
    tie my %females, 'Tie::RDBM', {db       => 'mysql:common',
                                   table    => 'females',
                                   key      => 'name',
                                   value    => 'weight'};
    tie my %males,   'Tie::RDBM', {db       => 'mysql:common',
                                   table    => 'males',
                                   key      => 'name',
                                   value    => 'weight'};
    &gender_init(\%females, \%males);

COMPATIBILITY

To run v0.30 in a (mostly) backward compatible mode, override the MATCH_LIST like so:

    @Text::GenderFromName::MATCH_LIST = ('v2_rules', 'v1_rules');

and set the looseness to any value greater than 1:

    &gender($_, 9);

Note that v0.30 uses significantly different lists than before. If you'd like to use the v0.20 name lists, you may download a previous version of Text::GenderFromName, cut out the hashes, and use the &gender_init() function to use those lists instead. To minimize the size of this module, they are not included in this module.

CAVEATS

REGARDING THIS MODULE

Rules are now case-insensitive, which is a departure from earlier versions of this module. Also, Orwant's v0.20 rules no longer fall through, though v0.10's do.

Version 0.30 was a complete overhaul by someone who's never submitted a module to CPAN before. Please consider this fact when using Text::GenderFromName module in a production environment.

Also note that the matching routines in this module are strongly biased toward American first names. None of the methods included in this module correctly identify the v0.30 author's gender (m) from his first name (Eamon).

REGARDING THE DEFAULT LIST

From http://www.ssa.gov/OACT/babynames/1999/top1000of80s.html:

"The data comes from a 5% sampling of Social Security card applications with dates of birth from January 1980 through December 1989."

"All names which occurred at least five times in the sample are included in the table below. The total number of males in the sample is 977,255 and the total number of females is 936,349. Criteria to be included in the sample is simply that a Social Security card application was filed, that the year of birth was between 1980 and 1989, and that the birth was on US soil. As always each unique spelling is considered a unique name. It may be appropriate for purposes of ranking popularity of names to combine similar spellings of the same name. This kind of grouping, however, is subjective and time consuming, and is beyond the scope of this document. The 2000 edition of the World Almanac lists the top 10 names of each decade based on this data after combining different spellings of the same name."

"No effort has been made to edit the data and as a result some coding errors are obvious. For example initials like "A" are included in the lists. Another common problem, especially for the earlier decades is females coded as being male. For example Jessica is the ranked 647 among male names. Finally entries like "Unknown" and "Baby" are not removed from the lists."

REGARDING HENRY

m (0.111843889261247)

BUGS

Did I mention this module doesn't match the v0.30 author's name?

AUTHOR

Originally by Jon Orwant <orwant@readable.com>, v0.30 by Eamon Daly <eamon@eamondaly.com>.

This is an adaptation of an 8/91 awk script by Scott Pakin in the December 91 issue of Computer Language Monthly.

Small contributions by Andrew Langmead and John Strickler. Thanks to Bob Baldwin, Matt Bishop, Daniel Klein, and the U.S. SSA for their lists of names.

SEE ALSO

Text::DoubleMetaphone