Daniel Perrett > WordLists > WordLists::Sort

Download:
WordLists-0.013.tar.gz

Annotate this POD

CPAN RT

New  1
Open  0
View/Report Bugs
Source  

NAME ^

WordLists::Sort

SYNOPSIS ^

Provides a structure for comparison functions, generally for complex sort.

        # The following sorts "No6" "No.7" "no 8" in that order - ignoring punctuation.
        @sorted = sort { atomic_compare (
                $a,$b,{ n => sub{ $_[0]=~s/[^[:alnum:]]//g; lc $_[0]; } } 
        ) } @unsorted;
        
        # The following sorts A9 before A10.
        @sorted = sort { atomic_compare ( 
                $a,$b,{ t => [ { re => qr/[0-9]+/, c => sub { $_[0] <=> $_[1]; } }, ], } } 
        ) } @unsorted; 

DESCRIPTION ^

This is by far and away the most evil member of the Wordlists family (it's also pretty much unrelated to all the others). It is basically a terse way of writing complex comparison/sort functions as one liners (if you want to).

The intention is to be able to sort by several different criteria, e.g. so "the UN" sorts after "un-" and before "unabashed", and/or so that "F\x{E9}" sorts after "Fe" but before "FE".

Once you've written/cribbed a sort algorithm, it's easy to use - just put it in a subroutine and call it. (Actually, what you're writing is a comparison algrithm, which perl's sort then calls).

Writing it is a bit harder, though: the framework involves (potentially) anonymous coderefs sprinkled amidst the hashrefs - it's much easier with indentation.

FUNCTIONS ^

atomic_compare

atomic_compare: This provides most of the functionality in the module. It allows normalisation of the arguments, tokenisation so that different sections can be compared with different criteria, and, if so desired, flipping of the result.

Function arguments

n: Normalise. This should be a coderef. If present, runs the code on each argument before comparison. Note that this only happens locally to the function, so lowercasing in functions[1]{n} will not prevent functions[2] putting VAT before vat. (If you want to keep them, nest the original function in the c). If n is an arrayref, it runs the first code on $a, the second on $b.

t: Tokenize. An arrayref containing hashrefs, each of which is attempted in order. In each hashref should be a regex keyed to re which will match in case you want do different comparisons on different types of data. Permitted values, other than coderefs, are 0 (e.g. {re=>qr/\d/, 'c'=>0} means 1 and 9 are equivalent), -1 or 1 (meaning that if this token is discovered at the same location, $a or $b always wins - NB that this is to be avoided in sort functions).

f: Flip. if set to 1, then the result is reversed (-1 becomes 1 and vice versa but 0 stays the same).

c: Comparison to use for text which doesn't match a token. Default behaviour is to use a wrapper for cmp.

Below is an atomic_compare function for sorting names of units in a workbook. You want them to appear in the order Welcome, Unit 1, Unit 2, ... Unit 11, but perl's cmp would put "Welcome" at the end and sorts "Unit 11" before "Unit 2". The normalisation is a hack to pretend 'Welcome' is equivalent to "Unit 0", and the tokenisation instructs that series of digits should be compared as numbers and not as strings, so 10 and 11 now sort after 9.

        atomic_compare (
                $a, $b,
                {
                        n => sub{
                                $_[0] =~ s/^Welcome$/Unit 0/i; 
                                lc $_[0];
                        },
                        t =>
                        [
                                {
                                        re => qr/[0-9]+/, 
                                        c => sub { $_[0] <=> $_[1]; } 
                                },
                        ],
                }
        );

complex_compare

complex_compare allows a user to perform several successive comparisons, returning a value as soon as but only when a nonzero result is achieved. This is useful for situations such as:

complex_compare is pretty much equivalent to atomic_compare(...) || atomic_compare(...) || ..., but is potentially less confusing than using the || or or operators, and may be easier to code than repeating the atomic_compare, $a, $b.

Function arguments

override_eq: Prevents the function returning 0 immediately when the strings are identical. (The arguments are ordinarily tested for string equality at the beginning, in order to prevent unnecessary processing.) Setting this flag is only necessary if you have a condition which forces $a or $b to win for certain strings which are equal.

functions: In <complex_compare>, an arrayref containing a hashref equivalent to the third value in atomic_compare. Each function performs a comparison which executes and returns 0, 1, or -1. If the result of any function is nonzero, that is the return value. If it is zero, the next function is tried.

        complex_compare ($a, $b, {
                functions =>
                [
                        {
                                n => sub{lc $_[0];}, #is \&lc possible?
                        },
                        {
                                n => sub{$_[0] =~ s/^the\s//; $_[0];},
                                t => 
                                        [
                                                { qr/[^[:alpha:]]/ => 0 },
                                        ],
                                f => 1,
                                c => sub { $_[0] cmp $ [1] },
                        },
                ]
        })

naive_collate

Performs a collation on an arrayref. Provide a) the data, b) a comparison function which returns 0 if the two comparands are duplicates, and c) a merge function which takes the first and later duplicate.

NB: This is slow, use either sorted_collate or schwartzian_collate unless you have a good reason for using this function (e.g. your comparison function is unstable and doesn't function like cmp). To discourage casual use, it is not exported.

sorted_collate

Like naive_collate, but faster.

It is faster because rather than comparing every element against every other element that hasn't already been collated, it sorts once first, then compares against following elements until it finds one which doesn't match, then stops. The list is then returned to its original order (except without the duplicates).

schwartzian_collate

Like sorted_collate, but uses a Schwartzian transform: after the comparison function, provide a normalisation function.

It is about as fast as the naive_collate, but can be several times faster when the normalisation function is complex.

TODO ^

BUGS ^

Please use the Github issues tracker.

LICENSE ^

Copyright 2011-2012 © Cambridge University Press. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: