Eugene Eric Kim > Lingua-EN-Gender-1.0 > Lingua::EN::Gender

Download:
Lingua-EN-Gender-1.0.tar.gz

Dependencies

Annotate this POD

View/Report Bugs
Module Version: 1.0   Source  

NAME ^

Lingua::EN::Gender - Guesses author's gender by analyzing text.

SYNOPSIS ^

  use Lingua::EN::Gender;

  my $text = "These are the days that try men's souls.";

  my $lingua = Lingua::EN::Gender->new($text);
  print $lingua->gender . "\n";  # male

ABSTRACT ^

  Lingua::EN::Gender guesses an author's gender by analyzing text
  using the Koppel-Argamon algorithm.

DESCRIPTION ^

This Perl module implements the Koppel-Argamon algorithm for guessing an author's gender. The algorithm was invented by Moshe Koppel (Bar-Ilan University, Israel) and Shlomo Argamon (Illinois Institute of Technology), and is described at:

  http://www.nytimes.com/2003/08/10/magazine/10wwln-test.html

ALGORITHM

Count the number of words in the document.

For each appearance of the following words, add the points indicated:

  "the"                    17
  "a"                       6
  "some"                    6
  number                    5
  "it"                      2
  "with"                  -14

  possessives,
    ending in "'s"         -5
    pronouns               -3

  "for"                    -4
  "not"                    -4
  word ending with "n't"    4

If the total score is greater than the total number of words, the author is probably a male. Otherwise, the author is probably a female.

IMPLEMENTATION

The algorithm is fairly straightforward, although there are a few twists and turns. My implementation does the following:

  * Counts hyphenated words as two words.

  * Knows that "it's" is not a possessive pronoun.

  * Recognizes the British spelling of "fourty."

The biggest complication with my implementation is in how it handles numbers. If a number is preceded by another number, it only scores it as a single number, even though it's counted as two words. For example:

  one hundred

is counted as one number (with a score of 5) and two words. My implementation does not handle the following situation correctly:

  First one.  Two next.

It would count this as one number (score 5) and four words, even though it should be two numbers (score 10) and four words. It wouldn't be that difficult to handle these types of situations, but I was lazy, and I don't think it will make much of a difference. Maybe in the next version.

SEE ALSO ^

McGrath, Charles. "Sexed Texts." New York Times Magazine, August 10, 2003. http://www.nytimes.com/2003/08/10/magazine/10WWLN.html

Ball, Philip. "Computer program detects author gender." Nature, July 18, 2003. http://www.nature.com/nsu/030714/030714-13.html

I first discovered this work at:

  http://www.bookblog.net/gender/genie.html

AUTHOR ^

Eugene Eric Kim, <eekim@blueoxen.org>

COPYRIGHT AND LICENSE ^

Copyright (c) Blue Oxen Associates 2003. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: