NAME
Sort::DataTypes - Sort a list of data using methods relevant to the type
of data
SYNOPSIS
use Sort::DataTypes qw(:all);
DESCRIPTION
This module allows you to sort a list of data elements using methods
that are relevant to the type of data contained in the list. This
modules does not attempt to be the fastest sorter on the block. If you
are sorting thousands of elements and need a lot of speed, you should
refer to a module specializing in the specific type of sort you will be
doing. However, to do smaller sorts of different types of data, this is
the module to use.
TYPES OF SORT METHODS
When sorting a list of elements, the elements are taken two at a time
and compared using some comparison function or operator. For example, if
you are comparing two strings alphabetically, the perl 'cmp' operator is
used to compare two strings. The power of this module is that the
comparison can be done in a way that is relevant to the type of data
(for example, when comparing dates, it can determine which is earlier,
or when sorting a list of IP numbers, it knows how to compare two
different IPs).
There are serveral types of sort methods which determine how the
comparison will be done:
Unambiguous Methods
Unambiguous sort methods are those which unambiguously determine the
order of two elements in all cases.
As an example, an alphabetic sort is unambiguous. It takes the
entire string of both elements and compares them and the order is
well defined.
Ambiguous Methods
Ambiguous methods are methods which compare the content of the two
elements but are not able to determine the relative order in all
situations. In these situations, additional sort methods may be used
to refine the comparison.
As an example, if you sort strings by length, there is an
unambiguous order when comparing a string of length 3 to one of
length 4, but if you have two strings of the same length, a
secondary sort method may be used to determine the order of these
elements.
Split-Element Methods
Split-element methods are used to split an element into pieces, and
two different elements are compared by comparing the individual
pieces.
As an example, if you are sorting domain names, you would first
split the domain name into a list of subdomains (i.e. foo.bar.com
contains the subdomains foo, bar, and com) and then each subdomain
is sorted separately.
These elements require at least three pieces of information. 1) They
need information on how to split the element into pieces. 2) They
need to know whether the pieces are left most significant (LMS) or
right most significant (RMS). In other words, whether to sort the
pieces from left to right or right to left. 3) They need sorting
information about how to compare individual pieces of two elements.
Partial-Element Methods
Partial-element methods are methods which work with only a portion
of the element. These require two types of information. 1) They
require some information about what portion of the element to sort
on. 2) They requires information about how to compare those
subelements.
As an example, you might sort a list of lines of text based on the
Nth field in each line. So the first information required will be
used to determine how to find the Nth field. The second information
will be the actual sort method to use for ordering those fields.
USING SORT ROUTINES
All sort routines are named sort_METHOD where METHOD is the name of the
method. All sort_METHOD have both a forward and reverse sort:
$flag = sort_METHOD(\@list [,@args]);
$flag = sort_rev_METHOD(\@list [,@args]);
where @args are any additional arguments used for the sort. These will
be described below.
Corresponding to every sort_METHOD routine is a cmp_METHOD routine which
takes two elements and returns a -1, 0, or 1 (similar to the cmp or <=>
operators).
$flag = cmp_METHOD($x,$y [,@args]);
$flag = cmp_rev_METHOD($x,$y [,@args]);
Finally, there is an alternate way to do the sort/comparison:
$flag = sort_by_method('METHOD', \@list [,@args]);
$flag = sort_by_method('rev_METHOD',\@list [,@args]);
$flag = cmp_by_method('METHOD', \@list [,@args]);
$flag = cmp_by_method('rev_METHOD',\@list [,@args]);
As an example, the following two calls are identical:
$flag = sort_alphabetic(\@list);
$flag = sort_by_method('alphabetic',\@list);
The value of $flag for a sort method is undef (if there is an error) or
1 if the sort succeeds (and in this case, @list has been reordered to be
in the sorted order). The value of $flag for a cmp method is undef (if
there is an error) or -1, 0, or 1.
The contents of @args depends on the type of sort method and are
described in the sections below.
UNAMBIGUOUS METHODS
As described above, unambiguous methods do not use any secondary sort
methods. For each of these sort methods, the contents of @args are:
@args = (@method_args, $hash)
and for cmp methods, the contents of @args are:
@args = (@method_args)
@method_args is a list of arguments that will be passed to the method.
Most unambiguous methods do not require any additional arguments, but if
they do, they would be here. The list of possible arguments are
described in the documentation for each method.
$hash is an optional hash reference. All sort_METHOD functions can be
used to sort a list using a hash. For example, in the following case:
@list = qw(foo bar ick);
%hash = ( foo => 3, bar => 5, ick => 1 );
sort_numerical(\@list,\%hash);
would result in @list containing:
(ick, foo, bar)
since those correspond to numerical values of (1,3,5) respectively.
Each element in @list must be a key in %hash, and the value of that key
must be of the appropriate type.
The following methods are supported:
sort_numerical
sort_rev_numerical
cmp_numerical
cmp_rev_numerical
use Sort::DataTypes qw(:all)
$flag = sort_numerical(\@list [,@args]);
$flag = sort_rev_numerical(\@list [,@args]);
$flag = cmp_numerical($x,$y [,@args]);
$flag = cmp_rev_numerical($x,$y [,@args]);
These sort/compare numbers. There is little reason to use any of
these routines since it would be more efficient to simply call sort
as:
sort { $a <=> $b } @list
but they are included for the sake of completeness, and to make them
available for use by the sort_by_method and cmp_by_method routines.
sort_alphabetic
sort_rev_alphabetic
cmp_alphabetic
cmp_rev_alphabetic
use Sort::DataTypes qw(:all)
$flag = sort_alphabetic(\@list [,@args]);
$flag = sort_rev_alphabetic(\@list [,@args]);
$flag = cmp_alphabetic($x,$y [,@args]);
$flag = cmp_rev_alphabetic($x,$y [,@args]);
These sort/compare strings alphabetically. As with numerical sorts,
there is little reason to call these, and they are included for the
sake of completeness.
sort_alphanum
sort_rev_alphanum
cmp_alphanum
cmp_rev_alphanum
use Sort::DataTypes qw(:all)
$flag = sort_alphanum(\@list [,@args]);
$flag = sort_rev_alphanum(\@list [,@args]);
$flag = cmp_alphanum($x,$y [,@args]);
$flag = cmp_rev_alphanum($x,$y [,@args]);
These do numeric sort/comparison if two elements are numeric
(integer or real) and alphabetic sorts otherwise.
sort_random
sort_rev_random
cmp_random
cmp_rev_random
use Sort::DataTypes qw(:all)
$flag = sort_random(\@list [,@args]);
$flag = sort_rev_random(\@list [,@args]);
$flag = cmp_random($x,$y [,@args]);
$flag = cmp_rev_random($x,$y [,@args]);
This randomly shuffles an array in place.
The sort_random and sort_rev_random routines are identical, and are
included simply for the situation where the sort routines are being
called in some automatically generated code that may add the 'rev_'
prefix.
The cmp_random and cmp_rev_random routines simply returns a random
-1, 0, or 1.
sort_version
sort_rev_version
cmp_version
cmp_rev_version
use Sort::DataTypes qw(:all)
$flag = sort_version(\@list [,@args]);
$flag = sort_rev_version(\@list [,@args]);
$flag = cmp_version($x,$y [,@args]);
$flag = cmp_rev_version($x,$y [,@args]);
These sort a list of version numbers of the form
MAJOR.MINOR.SUBMINOR ... (any number of levels are allowed). The
following examples should illustrate the ordering:
1.1.x < 1.2 < 1.2.x Numerical versions are compared first at
the highest level, then at the next highest,
etc. The first non-equal compare sets the
order.
1.a < 1.b Alphanumeric levels that start with a letter
are compared alphabetically.
1.2a < 1.2 < 1.03a Alphanumeric levels that start with a number
are first compared numerically with only the
numeric part. If they are equal, alphanumeric
levels come before purely numerical levels.
Otherwise, they are compared alphabetically.
1.a < 1.2a An alphanumeric level that starts with a letter
comes before one that starts with a number.
1.01a < 1.1a Two alphanumeric levels that are numerically
equal in the number part and equal in the
remaining part are compared alphabetically.
sort_date
sort_rev_date
cmp_date
cmp_rev_date
use Sort::DataTypes qw(:all)
$flag = sort_date(\@list [,@args]);
$flag = sort_rev_date(\@list [,@args]);
$flag = cmp_date($x,$y [,@args]);
$flag = cmp_rev_date($x,$y [,@args]);
These sort/compare a list of dates. Dates are anything that can be
parsed with Date::Manip.
It should be noted that the dates will only be parsed a single time,
so it is not necessary to pre-parse them for performance reasons.
sort_ip
sort_rev_ip
cmp_ip
cmp_rev_ip
use Sort::DataTypes qw(:all)
$flag = sort_ip(\@list [,@args]);
$flag = sort_rev_ip(\@list [,@args]);
$flag = cmp_ip($x,$y [,@args]);
$flag = cmp_rev_ip($x,$y [,@args]);
These sort/compare IP numbers. Each value can be a pure IP (in the
form A.B.C.D) or a CIDR notation which includes the netmask
(A.B.C.D/MASK).
When comparing CIDR representations, if the IP part of two elements
is identical, the following two rules are used:
an element without a mask comes before one that has a mask
two elements with masks are sorted by mask
So the following elements are in sorted order:
10.20.30.40 < 10.20.30.40/4 < 10.20.30.40/16
sort_nosort
sort_rev_nosort
cmp_nosort
cmp_rev_nosort
use Sort::DataTypes qw(:all)
$flag = sort_nosort(\@list [,@args]);
$flag = sort_rev_nosort(\@list [,@args]);
$flag = cmp_nosort($x,$y [,@args]);
$flag = cmp_rev_nosort($x,$y [,@args]);
These leave the list unchanged. This primarily useful as an
alternative sort method if you do not wish to sort beyond a method
that is ambiguous.
sort_function
sort_rev_function
cmp_function
cmp_rev_function
use Sort::DataTypes qw(:all)
$flag = sort_function(\@list [,@args]);
$flag = sort_rev_function(\@list [,@args]);
$flag = cmp_function($x,$y [,@args]);
$flag = cmp_rev_function($x,$y [,@args]);
This is a catch-all sort function. @method_args contains a single
argument. It is either a coderef or the name of a function suitable
to compar two elements and return -1, 0, or 1 depending on the order
of the elements.
The following both work:
$flag = sort_function(\@list,\&somefunc);
$flag = sort_function(\@list,"somefunc");
If the function is passed in by name, it must be in the calling
programs namespace OR it must be passed in as a fully specified
function name including package (i.e. "package::functionname").
AMBIGUOUS METHODS
As described above, ambiguous methods do use a secondary sort methods.
For these sort methods, the contents of @args are:
@args = (@method_args, $hash, @extra_cmp_info)
and for cmp methods, the contents of @args are:
@args = (@method_args, @extra_cmp_info)
@method_args and $hash are similar to those described above for
unambiguous methods.
The contents of @extra_cmp_info are:
@extra_cmp_info = ( [$method, @method_args],
[$method, @method_args],
...
)
Since an ambiguous method cannot always determine the order of two
elements, a backup method (or methods) may be specified. The backup sort
method contains a method name ($method) and any arguments required for
that method. The method must be either ambiguous or unambiguous. If it
is ambiguous, an additional backup method may be used. If a method is
unambiguous, no additional sort methods should be included.
If a backup method is not supplied for an ambiguous method, a default
method will be used (typically alphabetic).
For the example where you sort strings by length, if you want to sort
all elements of the same length randomnly, you could use the following
sort:
sort_length(\@list, ['random']);
The following methods are supported:
sort_length
sort_rev_length
cmp_length
cmp_rev_length
use Sort::DataTypes qw(:all)
$flag = sort_length(\@list [,@args]);
$flag = sort_rev_length(\@list [,@args]);
$flag = cmp_length($x,$y [,@args]);
$flag = cmp_rev_length($x,$y [,@args]);
These take strings and compare them by length. If they are the same
length, it sorts them by a secondary method (which defaults to
'alphabetic').
SPLIT-ELEMENT METHODS
As described above, split-element methods split an element into pieces,
and each of the pieces are compared separately using a secondary sort
method.
For these sort methods, the contents of @args are:
@args = (@method_args, $hash, @extra_sort_info)
and for cmp methods, the contents of @args are:
@args = (@method_args, @extra_cmp_info)
@method_args and $hash are similar to those described for unambiguous
methods.
A split-element method is not truly a sort method. It is simply a method
for splitting an element into parts. Then, every part must be sorted.
As such, every split-element method will use other sort methods for
actually sorting the pieces. If no @extra_sort_info or @extra_cmp_info
is supplied, it will typically default to alphabetic sort.
If other sort methods are supplied, any other ambiguous, or unambiguous
method may be supplied.
It should be understood that all pieces are compared using the same sort
methods. In other words, you cannot split an element into pieces and
compare the first set alphabetically, the second numerically, and the
third as dates. To do this, you have to use the partial-element methods
described next.
Another note is that if a piece is empty in one element and not in the
other, the empty one will sort before the filled one (unless a reverse
sort is being done).
Once the element is split into pieces, they may be compared starting at
the leftmost piece:
a:b:c < a:c:d
or starting at the rightmost piece:
c:b:a < a:b:c
It should be noted that if an element is missing a piece, it will always
come BEFORE an element that has the piece (unless it's a reverse sort in
which case it will come after.
As an example, if you are sorting strings containing colon separated
pieces, the following order will be used:
a::c < a:c:d
since the second piece is missing in the first element. Likewise:
a:b < a:b:c
since the third piece is missing in the first element.
The following split-element methods exist:
sort_split
sort_rev_split
cmp_split
cmp_rev_split
use Sort::DataTypes qw(:all)
$flag = sort_split(\@list [,@args]);
$flag = sort_rev_split(\@list [,@args]);
$flag = cmp_split($x,$y [,@args]);
$flag = cmp_rev_split($x,$y [,@args]);
The @method_args segments of the arguments contain two optional
arguments.
The first argument is either 'lms' or 'rms' (all options are case
sensitive, so they must be entered lowercase). If 'lms' is given,
pieces are sorted starting at the left. If 'rms' is given, they are
sorted from the right. 'lms' is the default.
The second argument is a regexp. It can be passed in as a string
that will be turned into a regular expression, or as an actaul
regexp, so one argument could be either of:
\s+
qr/\s+/
If no regexp is passed in, it defaults to
qr/\s+/
The following functions are also included for backward compatibility
with previous versions of this module.
These are deprecated, and may be removed at some point in the future.
These can all be done trivially with the split functions listed above
(and all are coded as wrappers around those functions), so slightly
better performance can be obtained by using the split functions
directly.
sort_domain
sort_rev_domain
cmp_domain
cmp_rev_domain
use Sort::DataTypes qw(:all)
$flag = sort_domain(\@list [,@args]);
$flag = sort_rev_domain(\@list [,@args]);
$flag = cmp_domain($x,$y [,@args]);
$flag = cmp_rev_domain($x,$y [,@args]);
Domain sorting is equivalent to split-element sorting with the
priority of 'rms' and a regular expression of qr/\./ . In other
words, the following are equivalent:
$flag = sort_domain(\@list);
$flag = sort_split(\@list,'rms',qr/\./);
A single argument can be passed in in @method_args containing an
alternate regular expression if the elements should be split on
something other than dots, but the priority will always be 'rms'.
Since the most significant subvalue in the domain is at the right,
any domain ending with ".com" would come before any domain ending in
".edu".
a.b < z.b < a.bb < z.bb < a.c
sort_numdomain
sort_rev_numdomain
cmp_numdomain
cmp_rev_numdomain
use Sort::DataTypes qw(:all)
$flag = sort_numdomain(\@list [,@args]);
$flag = sort_rev_numdomain(\@list [,@args]);
$flag = cmp_numdomain($x,$y [,@args]);
$flag = cmp_rev_numdomain($x,$y [,@args]);
A related type of sorting is numdomain sorting. This is identical to
domain sorting except that if two elements in the domain are
numerical, numerical sorts will be done. So:
a.2.c < a.11.c
It should be noted that if a field may be either numeric or
alphanumeric, sorting with this method may yield unexpected results.
For example, sorting the three elements:
a.1.b
a.2.b
a.X.b
will use numeric comparisons when comparing the 2nd field of the
first and second elements, but it will use alphabetic comparisons
when comparing the first and third elements (or the second and third
elements).
sort_path
sort_rev_path
cmp_path
cmp_rev_path
use Sort::DataTypes qw(:all)
$flag = sort_path(\@list [,@args]);
$flag = sort_rev_path(\@list [,@args]);
$flag = cmp_path($x,$y [,@args]);
$flag = cmp_rev_path($x,$y [,@args]);
Path sorting is equivalent to split-element sorting with the
priority of 'lms' and a regular expression of qr/\// . In other
words, the following are equivalent:
$flag = sort_path(\@list);
$flag = sort_split(\@list,'lms',qr/\//);
A single argument can be passed in in @method_args containing an
alternate regular expression if the elements should be split on
something other than slashes, but the priority will always be 'lms'.
Since the most significant element in the domain is at the left, you
get the following behavior:
a/b < a/z < aa/b < aa/z < b/b
When sorting lists that have a mixture of relative paths and
explicit paths, the explicit paths will come first. So:
/b/c < a/b
sort_numpath
sort_rev_numpath
cmp_numpath
cmp_rev_numpath
use Sort::DataTypes qw(:all)
$flag = sort_numpath(\@list [,@args]);
$flag = sort_rev_numpath(\@list [,@args]);
$flag = cmp_numpath($x,$y [,@args]);
$flag = cmp_rev_numpath($x,$y [,@args]);
A related type of sorting is numpath sorting. This is identical to
path sorting except that if two elements in the path are numbers,
numerical sorts will be done. So:
a/2/c < a/11/c
PARTIAL-ELEMENT METHODS
Partial-element sorting is, as described above, to split the element
into fields and then compare based on the Nth field. In addition, you
are allowed to sort one field in one way, and a second field in an
entirely different way.
For example, you could sort lines of the format:
2010-01-30 Smith John
2010-01-30 Smith Adam
first by date (the 1st field), alphabetically by last name (2nd field),
and alphabetically by first name (3rd field).
For these sort/cmp methods, the contents of @args are:
@args = ( $sep, [@field_args], [@field_args], ...)
$sep is a regular expression used to split an element into fields. It
can be entered as either a regular expression or a string that is turned
into a regular expression:
qr/\s+/
\s+
It is optional, and defaults to qr/\s+/ (i.e. split on whitespace).
@field_args describes how to sort one of the fields. It is of the form:
@field_args = ( $n, $hash, @extra_cmp_info )
where $n is an integer and tells which field to sort (fields start at
0), $hash is an optional hashref to use for this field (it's keys are
the values of the field, NOT the values of the element), and
@extra_cmp_info is described in the ambiguous methods section above:
@extra_cmp_info = ( [$method, @method_args],
[$method, @method_args],
...
)
Sort methods must be either ambiguous or unambiguous.
To sort the above example (by date, last name, and first name), you
could use:
$flag = sort_partial(\@list, qr/\s+/, [1, ['date']],
[2, ['alphabetic']],
[3, ['alphabetic']]);
sort_partial
sort_rev_partial
cmp_partial
cmp_rev_partial
use Sort::DataTypes qw(:all)
$flag = sort_partial(\@list [,@args]);
$flag = sort_rev_partial(\@list [,@args]);
$flag = cmp_partial($x,$y [,@args]);
$flag = cmp_rev_partial($x,$y [,@args]);
This is the basic partial-element sort routine.
The following functions are also included for backward compatibility
with previous versions of this module.
These are deprecated, and may be removed at some point in the future.
These can all be done trivially with the partial functions listed above
(and all are coded as wrappers around those functions), so slightly
better performance can be obtained by using the split functions
directly.
sort_line
sort_rev_line
cmp_line
cmp_rev_line
use Sort::DataTypes qw(:all)
$flag = sort_line(\@list,$n [,$sep,] [,\%hash]);
$flag = sort_rev_line(\@list,$n [,$sep] [,\%hash]);
$flag = cmp_line($x,$y,$n [,$sep]);
$flag = cmp_rev_line($x,$y,$n [,$sep]);
These take a list of lines and sort on the Nth field using $sep as
the regular expression splitting the lines into fields. Fields are
numbered starting at 0. If no $sep is given, it defaults to white
space.
This is included for backward compatibility only and does not allow
sorting on more than one field, or specifying the sort method for
that field. It is recommended that you use the partial methods
above.
sort_numline
sort_rev_numline
cmp_numline
cmp_rev_numline
use Sort::DataTypes qw(:all)
$flag = sort_numline(\@list,$n [,$sep,] [,\%hash]);
$flag = sort_rev_numline(\@list,$n [,$sep] [,\%hash]);
$flag = cmp_numline($x,$y,$n [,$sep]);
$flag = cmp_rev_numline($x,$y,$n [,$sep]);
These are similar but will sort numerically if the Nth field is
numerical, and alphabetically otherwise.
MISC. ROUTINES
sort_valid_method
cmp_valid_method
use Sort::DataTypes qw(:all)
$flag = sort_valid_method($string);
$flag = cmp_valid_method($string);
These are identical and return 1 if there is a valid sort method
named $string in the module. For example, there is a function
"sort_numerical" defined in this modules, but there is no function
"sort_foobar", so the following would occur:
sort_valid_method("numerical")
=> 1
sort_valid_method("rev_numerical")
=> 1
sort_valid_method("foobar")
=> 0
Note that the methods must NOT include the "sort_" or "cmp_" prefix,
but the "rev_" prefix is allowed as shown in the example.
sort_by_method
cmp_by_method
use Sort::DataTypes qw(:all)
$flag = sort_by_method($method,\@list [,@args]);
$flag = cmp_by_method ($method,$ele1,$ele2 [,@args]);
These sort a list, or compare two elements, using the given method
(which is any string which returns 1 when passed to
sort_valid_method).
If the method is not valid, the list is left untouched.
BACKWARDS INCOMPATIBILITIES
The following are a list of backwards incompatibilities.
Version 2.00 handling of hashes
In version 1.xx, when sorting by hash, the hash was passed in as the
hash. As of 2.00, it is passed in by reference to avoid any
confusion with optional arguments.
KNOWN PROBLEMS
None at this point.
LICENSE
This script is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.
AUTHOR
Sullivan Beck (sbeck@cpan.org)