The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WordLists::Sort

SYNOPSIS

Provides a structure for comparison functions, generally for complex sort.

        # The following sorts "No6" "No.7" "no 8" in that order - ignoring punctuation.
        @sorted = sort { atomic_compare (
                $a,$b,{ n => sub{ $_[0]=~s/[^[:alnum:]]//g; lc $_[0]; } } 
        ) } @unsorted;
        
        # The following sorts A9 before A10.
        @sorted = sort { atomic_compare ( 
                $a,$b,{ t => [ { re => qr/[0-9]+/, c => sub { $_[0] <=> $_[1]; } }, ], } } 
        ) } @unsorted; 

DESCRIPTION

This is by far and away the most evil member of the Wordlists family (it's also pretty much unrelated to all the others). It is basically a terse way of writing complex comparison/sort functions as one liners (if you want to).

The intention is to be able to sort by several different criteria, e.g. so "the UN" sorts after "un-" and before "unabashed", and/or so that "F\x{E9}" sorts after "Fe" but before "FE".

Once you've written/cribbed a sort algorithm, it's easy to use - just put it in a subroutine and call it. (Actually, what you're writing is a comparison algrithm, which perl's sort then calls).

Writing it is a bit harder, though: the framework involves (potentially) anonymous coderefs sprinkled amidst the hashrefs - it's much easier with indentation.

FUNCTIONS

atomic_compare

atomic_compare: This provides most of the functionality in the module. It allows normalisation of the arguments, tokenisation so that different sections can be compared with different criteria, and, if so desired, flipping of the result.

Function arguments

n: Normalise. This should be a coderef. If present, runs the code on each argument before comparison. Note that this only happens locally to the function, so lowercasing in functions[1]{n} will not prevent functions[2] putting VAT before vat. (If you want to keep them, nest the original function in the c). If n is an arrayref, it runs the first code on $a, the second on $b.

t: Tokenize. An arrayref containing hashrefs, each of which is attempted in order. In each hashref should be a regex keyed to re which will match in case you want do different comparisons on different types of data. Permitted values, other than coderefs, are 0 (e.g. {re=>qr/\d/, 'c'=>0} means 1 and 9 are equivalent), -1 or 1 (meaning that if this token is discovered at the same location, $a or $b always wins - NB that this is to be avoided in sort functions).

f: Flip. if set to 1, then the result is reversed (-1 becomes 1 and vice versa but 0 stays the same).

c: Comparison to use for text which doesn't match a token. Default behaviour is to use a wrapper for cmp.

Below is an atomic_compare function for sorting names of units in a workbook. You want them to appear in the order Welcome, Unit 1, Unit 2, ... Unit 11, but perl's cmp would put "Welcome" at the end and sorts "Unit 11" before "Unit 2". The normalisation is a hack to pretend 'Welcome' is equivalent to "Unit 0", and the tokenisation instructs that series of digits should be compared as numbers and not as strings, so 10 and 11 now sort after 9.

        atomic_compare (
                $a, $b,
                {
                        n => sub{
                                $_[0] =~ s/^Welcome$/Unit 0/i; 
                                lc $_[0];
                        },
                        t =>
                        [
                                {
                                        re => qr/[0-9]+/, 
                                        c => sub { $_[0] <=> $_[1]; } 
                                },
                        ],
                }
        );

complex_compare

complex_compare allows a user to perform several successive comparisons, returning a value as soon as but only when a nonzero result is achieved. This is useful for situations such as:

  • For user-facing sorting, such as dictionary headwords where "the Internet" should normally sort after "internet" and not after "theft".

  • For sorting which requires heavy processing only in some cases, e.g. an identifier a_492 always sorts before b_1, whatever the numerical values, but to compare a_491 and a_492 an external resource (lookup table, AJAX data, etc.) must be consulted.

  • Where certain values have high priority, e.g. if you'd like to see the string 'undef' appear before the sting '0'.

complex_compare is pretty much equivalent to atomic_compare(...) || atomic_compare(...) || ..., but is potentially less confusing than using the || or or operators, and may be easier to code than repeating the atomic_compare, $a, $b.

Function arguments

override_eq: Prevents the function returning 0 immediately when the strings are identical. (The arguments are ordinarily tested for string equality at the beginning, in order to prevent unnecessary processing.) Setting this flag is only necessary if you have a condition which forces $a or $b to win for certain strings which are equal.

functions: In <complex_compare>, an arrayref containing a hashref equivalent to the third value in atomic_compare. Each function performs a comparison which executes and returns 0, 1, or -1. If the result of any function is nonzero, that is the return value. If it is zero, the next function is tried.

        complex_compare ($a, $b, {
                functions =>
                [
                        {
                                n => sub{lc $_[0];}, #is \&lc possible?
                        },
                        {
                                n => sub{$_[0] =~ s/^the\s//; $_[0];},
                                t => 
                                        [
                                                { qr/[^[:alpha:]]/ => 0 },
                                        ],
                                f => 1,
                                c => sub { $_[0] cmp $ [1] },
                        },
                ]
        })

naive_collate

Performs a collation on an arrayref. Provide a) the data, b) a comparison function which returns 0 if the two comparands are duplicates, and c) a merge function which takes the first and later duplicate.

NB: This is slow, use either sorted_collate or schwartzian_collate unless you have a good reason for using this function (e.g. your comparison function is unstable and doesn't function like cmp). To discourage casual use, it is not exported.

sorted_collate

Like naive_collate, but faster.

It is faster because rather than comparing every element against every other element that hasn't already been collated, it sorts once first, then compares against following elements until it finds one which doesn't match, then stops. The list is then returned to its original order (except without the duplicates).

schwartzian_collate

Like sorted_collate, but uses a Schwartzian transform: after the comparison function, provide a normalisation function.

It is about as fast as the naive_collate, but can be several times faster when the normalisation function is complex.

TODO

  • Add lots more dwimmery so that qr// expressions can be used wherever they're likely to be useful and coderefs can be substituted in for regexes where their function is to test.

  • Make priority lists a bit easier, e.g. by allowing regexes - or even strings - amongst the hashrefs in t.

  • Write some good, sensible examples for complex_compare.

  • Gather useful common comparison functions which can be imported, studied, borrowed, etc. and offer them as a module.

  • Write more test cases.

  • Possibly remove assumptions about the comparanda, i.e. permit comparison of objects, references, etc. (But then: how does tokenisation work? Maybe it only works if n=>sub{$_[0]->toString}? Wouldn't we want to compare hashrefs and arrayrefs more intelligently?)

  • Figure out how to get atomic_compare and complex_compare to work with Schwartzian transforms.

BUGS

Please use the Github issues tracker.

LICENSE

Copyright 2011-2012 © Cambridge University Press. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.