The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::EN::GivenNames::Database::Import - An SQLite database of derivations of English given names

Synopsis

See "Synopsis" in Lingua::EN::GivenNames for a long synopsis.

See also "How do the scripts and modules interact to produce the data?" in Lingua::EN::GivenNames.

Description

Documents the methods used to populate the SQLite database, lingua.en.givennames.sqlite, which ships with this distro.

See "Description" in Lingua::EN::GivenNames for a long description.

Also, it's vital you study "How do the scripts and modules interact to produce the data?" in Lingua::EN::GivenNames. See also scripts/import.sh for the order in which they must be run.

Distributions

This module is available as a Unix-style distro (*.tgz).

See http://savage.net.au/Perl-modules.html for details.

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.

Constructor and initialization

new(...) returns an object of type Lingua::EN::GivenNames::Database::Import.

This is the class's contructor.

Usage: Lingua::EN::GivenNames::Database::Import -> new().

Methods

This module is a sub-class of Lingua::EN::GivenNames::Database and consequently inherits its methods.

extract_derivations()

Extract the derivations from 1 page of either female or male English given names, and write them to data/derivations.raw.

This file is opened during each method call in append mode ('>>'), meaning if you wish to start from scratch, that file must be deleted before scripts/extract.derivations.pl is run. See scripts/import.sh for details.

Since the input data/*.htm files contain data in alphabetical order (usually), the output is also in order.

The output file is processed by parse_derivations().

Returns 0 to indicate success.

generate_derivation($item)

Takes a hashref, $item, and constructs a string which is the derivation of the given name whose components are the values of various keys in this hashref.

The string returned depends on which regexp was used to parse the input.

See "FAQ" in Lingua::EN::GivenNames for details.

import_derivations()

Reads the file data/derivations.csv created by sub parse_derivations() by calling read_derivations().

It checks for duplicate records, and then writes all the data to the appropriate database tables.

Returns 0 to indicate success.

new()

See "Constructor and initialization".

parse_derivations()

Reads the file data/derivations.raw created by sub extract_derivations(), applies a set of regexps to each line, and writes data/derivations.csv.

Mismatches are written to data/mismatches.log, and a 1-line report is written to data/parse.log.

Clearly, this is where most of the work takes place.

Returns 0 to indicate success.

read_derivations()

This method is called by sub import_derivations(). It reads and validates data/derivations.raw.

Also, this method checks to ensure no data is missing, which would indicate a programming error in the handling of the output from the regexp processing phase.

Returns an arrayref.

validate_derivations($file_name, $derivation)

$file_name is the file currently being processed (data/derivations.csv), and is used for error messages.

$derivation is a hashref keyed by columns in the input file, so unique entries in each column can be checked.

This method is called by sub read_derivations(). It performs a simple reasonableness check on each input line, and also logs, at level notice, all non-ASCII names.

write_names($table, $derivation, $foreign_key)

$table is the name of the table to write, which is always names.

$derivation is an arrayref of derivations to write.

$foreign_key is a hashref of primary keys returned by "write_table($table, $item)" for each table other than the names table.

Called by sub import_derivations() and writes the names table.

write_table($table, $item)

$table is the name of the table to write.

$item is an arrayref of values to write.

Called by sub import_derivations() and writes all tables except the names table.

Returns a hashref of primary key ids for use as foreign keys when the names table is written.

FAQ

See "FAQ" in Lingua::EN::GivenNames.

How is the input scanned?

The regexps in sub parse_derivations() split each line of data/derivations.raw into these fields, when using the regexp called 'a':

o $1 => Sex
o $2 => Name
o $3 => Kind
o $4 => Form
o $5 => Source
o $6 => Original
o $7 => Rating
o $8 => Meaning

These fields are described in "FAQ" in Lingua::EN::GivenNames. Other regexps have similar outputs.

Matches using pattern 'a'

1) 'male. ALLISTAIR: Anglicized form of Scottish Gaelic Alastair, meaning "defender of mankind."' becomes the hashref (with keys in alphabetical order, and text from data/derivations.raw):

        {
                form     => 'form',
                kind     => 'Anglicized',
                meaning  => 'defender of mankind',
                name     => 'ALLISTAIR',
                original => 'Alastair',
                rating   => 'meaning',
                sex      => 'male',
                source   => 'Scottish Gaelic',
        }

The derivation is: Anglicized form of Scottish Gaelic Alastair, meaning "defender of mankind".

2) 'male. ANTONY: Variant spelling of English Anthony, possibly meaning "invaluable."' becomes:

        {
                form     => 'spelling',
                kind     => 'Variant',
                meaning  => 'invaluable',
                name     => 'ANTONY',
                original => 'Anthony',
                rating   => 'possibly meaning',
                sex      => 'male',
                source   => 'English',
        }

The derivation is: Variant spelling of English Anthony, possibly meaning "invaluable".

In each case the derivation is built by sub generate_derivation($item) as:

        qq|$$item{kind} $$item{form} of $$item{source} $$item{original}, $$item{rating} $$item{meaning}|

Matches using pattern 'b'

3) 'female. ANTONIA: Feminine form of Roman Latin Antonius, possibly meaning "invaluable." In use by the English, Italians and Spanish. Compare with another form of Antonia.' becomes:

        {
                form     => 'form',
                kind     => 'Feminine',
                meaning  => 'invaluable',
                name     => 'ANTONIA',
                original => 'Anthony',
                rating   => 'possibly meaning',
                sex      => 'female',
                source   => 'Roman Latin',
        }

The derivation is: Feminine form of Roman Latin Antonius, possibly meaning "invaluable".

The derivation is built by sub generate_derivation($item) as:

        qq|$$item{kind} $$item{form} of $$item{source} $$item{original}, $$item{rating} $$item{meaning}|

Matches using pattern 'c'

4) 'male. HENGIST: Old English name meaning "stallion." In English legend, this is the name of the brother of Horsa, and ruler of Kent. In Arthurian legend, he was killed by Uther Pendragon.' becomes:

        {
                form     => 'name',
                kind     => 'Old English',
                meaning  => 'stallion',
                name     => 'HENGIST',
                original => '-',
                rating   => 'meaning',
                sex      => 'male',
                source   => '-',
        }

The derivation is: Old English name, meaning "stallion".

The derivation is built by sub generate_derivation($item) as:

        qq|$$item{kind} $$item{form}, $$item{rating} $$item{meaning}|

Matches using pattern 'd'

5) 'female. PRU: Short form of English Prudence "cautious" and Prunella "little prune."' becomes:

        {
                form     => 'form',
                kind     => 'Short',
                meaning  => '"cautious" and Prunella "little prune"',
                name     => 'PRU',
                original => 'Prudence',
                rating   => 'meaning',
                sex      => 'female',
                source   => 'English',
        }

The derivation is: Short form of English Prudence, meaning "cautious" and Prunella "little prune".

The derivation is built by sub generate_derivation($item) as:

        qq|$$item{kind} $$item{form} of $$item{source} $$item{original}, $$item{rating} $$item{meaning}|

References

See "References" in Lingua::EN::GivenNames.

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=Lingua::EN::GivenNames.

Author

Lingua::EN::GivenNames was written by Ron Savage <ron@savage.net.au> in 2012.

Home page: http://savage.net.au/index.html.

Copyright

Australian copyright (c) 2012 Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html