The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Diversity::SamplingScheme - storing the parameters of a sampling scheme

VERSION

This documentation refers to Lingua::Diversity::SamplingScheme version 0.02.

SYNOPSIS

    # Lingua::Diversity::SamplingScheme is used by Lingua::Diversity::Variety.
    use Lingua::Diversity::Variety;
    
    # Create a new sampling scheme...
    my $sampling_scheme = Lingua::Diversity::SamplingScheme->new(
        'mode'              => 'segmental',
        'subsample_size'    => 100,
    );
    
    # ... Then apply it to a Lingua::Diversity::Variety object.
    Lingua::Diversity::Variety->new(
        'transform'         => 'type_token_ratio',
        'sampling_scheme'   => $sampling_scheme,
    );

DESCRIPTION

This class serves as storage for a set of parameters defining a sampling scheme (to be used with a Lingua::Diversity::Variety object). Such a scheme is meant to describe the kind of resampling that should be applied as well as the number of subsamples and their size.

CREATOR

The creator (new()) returns a new Lingua::Diversity::SamplingScheme object. It takes one required and two optional named parameters:

subsample_size (required)

The requested number of unit tokens per subsample (a positive integer).

num_subsamples

The number of subsamples to be drawn (a positive integer). Default is 100. Note that this parameter has no effect in segmental mode (see below), since in this case the number of subsamples is the result of the integer division of text length by requested subsample size.

mode

Either random (default) or segmental.

Value 'random' means that (i) the order of unit tokens in the text should not be modified in a given subsample, and (ii) the probability for a unit token to occur in a given subsample depends only on the requested subsample size (see subsample_size above). E.g. from text say you say me, the following subsamples of size 3 (and only them) could be generated (with uniform probability): say you say, say you me, say say me, and you say me.

Value 'segmental' means that subsamples should be continuous, non-overlapping sequences of units in the original text. For example, text say you say me would give rise to exactly two subsamples of size 2: say you and say me. Incomplete subsamples at the end of the text are ignored, so that a subsample size of 3 would produce a single subsample in this example (i.e. say you say). Note that in this mode, it is assumed that the unit and category arrays are in the text's order.

ACCESSORS

get_subsample_size() and set_subsample_size()

Getter and setter for the subsample_size attribute.

get_num_subsamples() and set_num_subsamples()

Getter and setter for the num_subsamples attribute.

get_mode() and set_mode()

Getter and setter for the mode attribute.

DEPENDENCIES

This module is part of the Lingua::Diversity distribution.

BUGS AND LIMITATIONS

There are no known bugs in this module.

Please report problems to Aris Xanthos (aris.xanthos@unil.ch)

Patches are welcome.

AUTHOR

Aris Xanthos (aris.xanthos@unil.ch)

LICENSE AND COPYRIGHT

Copyright (c) 2011 Aris Xanthos (aris.xanthos@unil.ch).

This program is released under the GPL license (see http://www.gnu.org/licenses/gpl.html).

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

Lingua::Diversity and Lingua::Diversity::Variety