WordLists::Sort
Provides a structure for comparison functions, generally for complex sort.
# The following sorts "No6" "No.7" "no 8" in that order - ignoring punctuation. @sorted = sort { atomic_compare ( $a,$b,{ n => sub{ $_[0]=~s/[^[:alnum:]]//g; lc $_[0]; } } ) } @unsorted; # The following sorts A9 before A10. @sorted = sort { atomic_compare ( $a,$b,{ t => [ { re => qr/[0-9]+/, c => sub { $_[0] <=> $_[1]; } }, ], } } ) } @unsorted;
This is by far and away the most evil member of the Wordlists family (it's also pretty much unrelated to all the others). It is basically a terse way of writing complex comparison/sort functions as one liners (if you want to).
The intention is to be able to sort by several different criteria, e.g. so "the UN" sorts after "un-" and before "unabashed", and/or so that "F\x{E9}" sorts after "Fe" but before "FE".
Once you've written/cribbed a sort algorithm, it's easy to use - just put it in a subroutine and call it. (Actually, what you're writing is a comparison algrithm, which perl's sort then calls).
sort
Writing it is a bit harder, though: the framework involves (potentially) anonymous coderefs sprinkled amidst the hashrefs - it's much easier with indentation.
atomic_compare: This provides most of the functionality in the module. It allows normalisation of the arguments, tokenisation so that different sections can be compared with different criteria, and, if so desired, flipping of the result.
atomic_compare
n: Normalise. This should be a coderef. If present, runs the code on each argument before comparison. Note that this only happens locally to the function, so lowercasing in functions[1]{n} will not prevent functions[2] putting VAT before vat. (If you want to keep them, nest the original function in the c). If n is an arrayref, it runs the first code on $a, the second on $b.
n
$a
$b
t: Tokenize. An arrayref containing hashrefs, each of which is attempted in order. In each hashref should be a regex keyed to re which will match in case you want do different comparisons on different types of data. Permitted values, other than coderefs, are 0 (e.g. {re=>qr/\d/, 'c'=>0} means 1 and 9 are equivalent), -1 or 1 (meaning that if this token is discovered at the same location, $a or $b always wins - NB that this is to be avoided in sort functions).
t
re
{re=>qr/\d/, 'c'=>0}
f: Flip. if set to 1, then the result is reversed (-1 becomes 1 and vice versa but 0 stays the same).
f
c: Comparison to use for text which doesn't match a token. Default behaviour is to use a wrapper for cmp.
c
cmp
Below is an atomic_compare function for sorting names of units in a workbook. You want them to appear in the order Welcome, Unit 1, Unit 2, ... Unit 11, but perl's cmp would put "Welcome" at the end and sorts "Unit 11" before "Unit 2". The normalisation is a hack to pretend 'Welcome' is equivalent to "Unit 0", and the tokenisation instructs that series of digits should be compared as numbers and not as strings, so 10 and 11 now sort after 9.
atomic_compare ( $a, $b, { n => sub{ $_[0] =~ s/^Welcome$/Unit 0/i; lc $_[0]; }, t => [ { re => qr/[0-9]+/, c => sub { $_[0] <=> $_[1]; } }, ], } );
complex_compare allows a user to perform several successive comparisons, returning a value as soon as but only when a nonzero result is achieved. This is useful for situations such as:
complex_compare
For user-facing sorting, such as dictionary headwords where "the Internet" should normally sort after "internet" and not after "theft".
For sorting which requires heavy processing only in some cases, e.g. an identifier a_492 always sorts before b_1, whatever the numerical values, but to compare a_491 and a_492 an external resource (lookup table, AJAX data, etc.) must be consulted.
a_492
b_1
a_491
Where certain values have high priority, e.g. if you'd like to see the string 'undef' appear before the sting '0'.
complex_compare is pretty much equivalent to atomic_compare(...) || atomic_compare(...) || ..., but is potentially less confusing than using the || or or operators, and may be easier to code than repeating the atomic_compare, $a, $b.
atomic_compare(...) || atomic_compare(...) || ...
||
or
override_eq: Prevents the function returning 0 immediately when the strings are identical. (The arguments are ordinarily tested for string equality at the beginning, in order to prevent unnecessary processing.) Setting this flag is only necessary if you have a condition which forces $a or $b to win for certain strings which are equal.
override_eq
functions: In <complex_compare>, an arrayref containing a hashref equivalent to the third value in atomic_compare. Each function performs a comparison which executes and returns 0, 1, or -1. If the result of any function is nonzero, that is the return value. If it is zero, the next function is tried.
functions
complex_compare ($a, $b, { functions => [ { n => sub{lc $_[0];}, #is \&lc possible? }, { n => sub{$_[0] =~ s/^the\s//; $_[0];}, t => [ { qr/[^[:alpha:]]/ => 0 }, ], f => 1, c => sub { $_[0] cmp $ [1] }, }, ] })
Performs a collation on an arrayref. Provide a) the data, b) a comparison function which returns 0 if the two comparands are duplicates, and c) a merge function which takes the first and later duplicate.
NB: This is slow, use either sorted_collate or schwartzian_collate unless you have a good reason for using this function (e.g. your comparison function is unstable and doesn't function like cmp). To discourage casual use, it is not exported.
sorted_collate
schwartzian_collate
Like naive_collate, but faster.
naive_collate
It is faster because rather than comparing every element against every other element that hasn't already been collated, it sorts once first, then compares against following elements until it finds one which doesn't match, then stops. The list is then returned to its original order (except without the duplicates).
Like sorted_collate, but uses a Schwartzian transform: after the comparison function, provide a normalisation function.
It is about as fast as the naive_collate, but can be several times faster when the normalisation function is complex.
Add lots more dwimmery so that qr// expressions can be used wherever they're likely to be useful and coderefs can be substituted in for regexes where their function is to test.
qr//
Make priority lists a bit easier, e.g. by allowing regexes - or even strings - amongst the hashrefs in t.
Write some good, sensible examples for complex_compare.
Gather useful common comparison functions which can be imported, studied, borrowed, etc. and offer them as a module.
Write more test cases.
Possibly remove assumptions about the comparanda, i.e. permit comparison of objects, references, etc. (But then: how does tokenisation work? Maybe it only works if n=>sub{$_[0]->toString}? Wouldn't we want to compare hashrefs and arrayrefs more intelligently?)
n=>sub{$_[0]->toString}
Figure out how to get atomic_compare and complex_compare to work with Schwartzian transforms.
Please use the Github issues tracker.
Copyright 2011-2012 © Cambridge University Press. This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install WordLists::Tag, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WordLists::Tag
CPAN shell
perl -MCPAN -e shell install WordLists::Tag
For more information on module installation, please visit the detailed CPAN module installation guide.