Jari Aalto > Lingua-EN-Squeeze-2006.0704 > Lingua::EN::Squeeze

Download:
Lingua-EN-Squeeze-2006.0704.tar.gz

Dependencies

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Module Version: 2006.0704   Source  

NAME ^

Squeeze.pm - Shorten text to minimum syllables by using hash table lookup and vowel deletion

REVISION ^

$Id: Squeeze.pm,v 1.8 2005-12-05 09:02:49 jaalto Exp $

SYNOPSIS ^

    use Squeeze.pm;         # import only function
    use Squeeze qw( :ALL ); # import all functions and variables
    use English;            # to use readable variable names

    while (<>)
    {
        print "Original: $ARG\n";
        print "Squeezed: ", SqueezeText lc $ARG;
    }

    #  Or you can use object oriented interface

    $squeeze = new Lingua::EN::Squeeze;

    while (<>)
    {
        print "Original: $ARG\n";
        print "Squeezed: ", $squeeze->SqueezeText(lc $ARG);
    }

DESCRIPTION ^

Squeeze english text to most compact format possible so that it is barely readable. Be sure to convert all text to lowercase before using the SqueezeText() for maximum compression, because optimizations have been designed mostly for uncapitalized letters.

Warning: Each line is processed multiple times, so prepare for slow conversion time

You can use this module e.g. to preprocess text before it is sent to electronic media that has some maximum text size limit. For example pagers have an arbitrary text size limit, typically around 200 characters, which you want to fill as much as possible. Alternatively you may have GSM cellular phone which is capable of receiving Short Messages (SMS), whose message size limit is 160 characters. For demonstration of this module's SqueezeText() function, this paragraph's conversion result is presented below. See yourself if it's readable (Yes, it takes some time to get used to). The compression ratio is typically 30-40%

    u _n use thi mod e.g. to prprce txt bfre i_s snt to
    elrnic mda has som max txt siz lim. f_xmple pag
    hv  abitry txt siz lim, tpcly 200 chr, W/ u wnt
    to fll as mch as psbleAlternatvly u may hv GSM cllar P8
    w_s cpble of rcivng Short msg (SMS), WS/ msg siz
    lim is 160 chr. 4 demonstrton of thi mods SquezText
    fnc ,  dsc txt of thi prgra has ben cnvd_ blow
    See uself if i_s redble (Yes, it tak som T to get usdto
    compr rat is tpcly 30-40

And if $SQZ_OPTIMIZE_LEVEL is set to non-zero

    u_nUseThiModE.g.ToPrprceTxtBfreI_sSntTo
    elrnicMdaHasSomMaxTxtSizLim.F_xmplePag
    hvAbitryTxtSizLim,Tpcly200Chr,W/UWnt
    toFllAsMchAsPsbleAlternatvlyUMayHvGSMCllarP8
    w_sCpbleOfRcivngShortMsg(SMS),WS/MsgSiz
    limIs160Chr.4DemonstrtonOfThiModsSquezText
    fnc,DscTxtOfThiPrgraHasBenCnvd_Blow
    SeeUselfIfI_sRedble(Yes,ItTakSomTToGetUsdto
    comprRatIsTpcly30-40

The comparision of these two show

    Original text   : 627 characters
    Level 0         : 433 characters    reduction 31 %
    Level 1         : 345 characters    reduction 45 %  (+14% improvement)

There are few grammar rules which are used to shorten some English tokens considerably:

    Word that has _ is usually a verb

    Word that has / is usually a substantive, noun,
                    pronomine or other non-verb

Read following substituting tokens in order to understand the basics of converted text. Hopefully, the text is not pure Geek code (tm) to you after some practice. In Geek code (Like G++L--J) you would need an external parser to understand it. Here some common sense and time is needed to adapt oneself to the compressed format. For a complete up to date list, you would be better off peeking the source code

    automatically => 'acly_'

    for           => 4
    for him       => 4h
    for her       => 4h
    for them      => 4t
    for those     => 4t

    can           => _n
    does          => _s

    it is         => i_s
    that is       => t_s
    which is      => w_s
    that are      => t_r
    which are     => w_r

    less          => -/
    more          => +/
    most          => ++

    however       => h/ver
    think         => thk_

    useful        => usful

    you           => u
    your          => u/
    you'd         => u/d
    you'll        => u/l
    they          => t/
    their         => t/r

    will          => /w
    would         => /d
    with          => w/
    without       => w/o
    which         => W/
    whose         => WS/

Time is expressed with big letters

    time          => T
    minute        => MIN
    second        => SEC
    hour          => HH
    day           => DD
    month         => MM
    year          => YY

Other big letter acronyms, think 8 to represent the speaker and the microphone.

    phone         => P8

EXAMPLES ^

To add new words e.g. to word conversion hash table, you'd define a custom set and merge them to existing ones. Do similarly to %SQZ_WXLATE_MULTI_HASH and $SQZ_ZAP_REGEXP and then start using the conversion function.

    use English;
    use Squeeze qw( :ALL );

    my %myExtraWordHash =
    (
          new-word1  => 'conversion1'
        , new-word2  => 'conversion2'
        , new-word3  => 'conversion3'
        , new-word4  => 'conversion4'
    );

    #   First take the existing tables and merge them with the above
    #   translation table

    my %mySustomWordHash =
    (
          %SQZ_WXLATE_HASH
        , %SQZ_WXLATE_EXTRA_HASH
        , %myExtraWordHash
    );

    my $myXlat = 0;                             # state flag

    while (<>)
    {
        if ( $condition )
        {
            SqueezeHashSet \%mySustomWordHash;  # Use MY conversions
            $myXlat = 1;
        }

        if ( $myXlat and $condition )
        {
            SqueezeHashSet "reset";             # Back to default table
            $myXlat = 0;
        }

        print SqueezeText $ARG;
    }

Similarly you can redefine the multi word translation table by supplying another hash reference in call to SqueezeHashSet(). To kill more text immediately in addition to default, just concatenate regexps to variable $SQZ_ZAP_REGEXP

KNOWN BUGS ^

There may be lot of false conversions and if you think that some word squeezing went too far, please 1) turn on the debug 2) send you example text 3) debug log log to the maintainer. To see how the conversion goes e.g. for word Messages:

    use English;
    use Lingua::EN:Squeeze;

    #   Activate debug when case-insensitive word "Messages" is found from
    #   the line.

    SqueezeDebug( 1, '(?i)Messages' );

    $ARG = "This line has some Messages in it";
    print SqueezeText $ARG;

EXPORTABLE VARIABLES ^

The defaults may not apply to all types of text, so you may wish to extend the hash tables and $SQZ_ZAP_REGEXP to cope with your typical text.

$SQZ_ZAP_REGEXP

Text to kill immediately, like "Hm, Hi, Hello..." You can only set this once, because this regexp is compiled immediately when SqueezeText() is called for the first time.

$SQZ_OPTIMIZE_LEVEL

This controls how optimized the text will be. Currently there is only level 0 (default) and level 1. Level 1 removes all spaces. That usually improves compression by average of 10%, but the text is more harder to read. If space is real tight, use this extended compression optimization.

%SQZ_WXLATE_MULTI_HASH

Multi Word conversion hash table: "for you" => "4u" ...

%SQZ_WXLATE_HASH

Single Word conversion hash table: word => conversion. This table is applied after %SQZ_WXLATE_MULTI_HASH has been used.

%SQZ_WXLATE_EXTRA_HASH

Aggressive Single Word conversions like: without => w/o are applied last.

INTERFACE FUNCTIONS ^

SqueezeObjectArg($)

Description

Return subroutine argument in both function and object cases. This is a wrapper utility to make package work as a function library as well as OO class.

@list

List of arguments. Usually the first one is object if class interface is used.

Return values

Return arguments without the first object parameter.

SqueezeText($)

Description

Squeeze text by using vowel substitutions and deletions and hash tables that guide text substitutions. The line is parsed multiple times and this will take some time.

arg1: $text

String. Line of Text.

Return values

String, squeezed text.

new()

Description

Return new class object.

Return values

Object.

SqueezeHashSet($;$)

Description

Set hash tables to use for converting text. The multiple word conversion is done first and after that the single words conversions.

arg1: \%wordHashRef

Pointer to a hash to be used to convert single words. If "reset", use default hash table.

arg2: \%multiHashRef [optional]

Pointer to a hash to be used to convert multiple words. If "reset", use default hash table.

Return values

None.

SqueezeControl(;$)

Description

Select level of compression, which can be one of noconv, enable, medium, maximum.

arg1: $state

String. If nothing, use maximum squeeze level. Other string values accepted are:

    noconv      Turn off squeeze
    conv        Turn on squeeze
    med         Set squeezing level to medium
    max         Set squeezing level to maximum
Return values

None.

SqueezeDebug(;$$)

Description

Activate or deactivate debug.

arg1: $state [optional]

If not given, turn debug off. If non-zero, turn debug on. You must also supply regexp if you turn on debug, unless you have given it previously.

arg2: $regexp [optional]

If given, use regexp to trigger debug output when debug is on.

Return values

None.

AVAILABILITY ^

Latest version of this module can be found at CPAN/modules/by-module/Lingua/

AUTHOR ^

Copyright (C) 1998-2005 Jari Aalto free software; you can redistribute it and/or modify it under the terms of Gnu General Public licence v2 or later.

syntax highlighting: