Lingua::EN::Gender - Guesses author's gender by analyzing text.
use Lingua::EN::Gender; my $text = "These are the days that try men's souls."; my $lingua = Lingua::EN::Gender->new($text); print $lingua->gender . "\n"; # male
Lingua::EN::Gender guesses an author's gender by analyzing text using the Koppel-Argamon algorithm.
This Perl module implements the Koppel-Argamon algorithm for guessing an author's gender. The algorithm was invented by Moshe Koppel (Bar-Ilan University, Israel) and Shlomo Argamon (Illinois Institute of Technology), and is described at:
Count the number of words in the document.
For each appearance of the following words, add the points indicated:
"the" 17 "a" 6 "some" 6 number 5 "it" 2 "with" -14 possessives, ending in "'s" -5 pronouns -3 "for" -4 "not" -4 word ending with "n't" 4
If the total score is greater than the total number of words, the author is probably a male. Otherwise, the author is probably a female.
The algorithm is fairly straightforward, although there are a few twists and turns. My implementation does the following:
* Counts hyphenated words as two words. * Knows that "it's" is not a possessive pronoun. * Recognizes the British spelling of "fourty."
The biggest complication with my implementation is in how it handles numbers. If a number is preceded by another number, it only scores it as a single number, even though it's counted as two words. For example:
is counted as one number (with a score of 5) and two words. My implementation does not handle the following situation correctly:
First one. Two next.
It would count this as one number (score 5) and four words, even though it should be two numbers (score 10) and four words. It wouldn't be that difficult to handle these types of situations, but I was lazy, and I don't think it will make much of a difference. Maybe in the next version.
McGrath, Charles. "Sexed Texts." New York Times Magazine, August 10, 2003. http://www.nytimes.com/2003/08/10/magazine/10WWLN.html
Ball, Philip. "Computer program detects author gender." Nature, July 18, 2003. http://www.nature.com/nsu/030714/030714-13.html
I first discovered this work at:
Eugene Eric Kim, <email@example.com>
Copyright (c) Blue Oxen Associates 2003. All rights reserved.
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.