The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Sort::DataTypes - Sort a list of data using methods relevant to the type of data

SYNOPSIS

   use Sort::DataTypes qw(:all);

DESCRIPTION

This module allows you to sort a list of data elements using methods that are relevant to the type of data contained in the list. This modules does not attempt to be the fastest sorter on the block. If you are sorting thousands of elements and need a lot of speed, you should refer to a module specializing in the specific type of sort you will be doing. However, to do smaller sorts of different types of data, this is the module to use.

TYPES OF SORT METHODS

When sorting a list of elements, the elements are taken two at a time and compared using some comparison function or operator. For example, if you are comparing two strings alphabetically, the perl 'cmp' operator is used to compare two strings. The power of this module is that the comparison can be done in a way that is relevant to the type of data (for example, when comparing dates, it can determine which is earlier, or when sorting a list of IP numbers, it knows how to compare two different IPs).

There are serveral types of sort methods which determine how the comparison will be done:

Unambiguous Methods

Unambiguous sort methods are those which unambiguously determine the order of two elements in all cases.

As an example, an alphabetic sort is unambiguous. It takes the entire string of both elements and compares them and the order is well defined.

Ambiguous Methods

Ambiguous methods are methods which compare the content of the two elements but are not able to determine the relative order in all situations. In these situations, additional sort methods may be used to refine the comparison.

As an example, if you sort strings by length, there is an unambiguous order when comparing a string of length 3 to one of length 4, but if you have two strings of the same length, a secondary sort method may be used to determine the order of these elements.

Split-Element Methods

Split-element methods are used to split an element into pieces, and two different elements are compared by comparing the individual pieces.

As an example, if you are sorting domain names, you would first split the domain name into a list of subdomains (i.e. foo.bar.com contains the subdomains foo, bar, and com) and then each subdomain is sorted separately.

These elements require at least three pieces of information. 1) They need information on how to split the element into pieces. 2) They need to know whether the pieces are left most significant (LMS) or right most significant (RMS). In other words, whether to sort the pieces from left to right or right to left. 3) They need sorting information about how to compare individual pieces of two elements.

Partial-Element Methods

Partial-element methods are methods which work with only a portion of the element. These require two types of information. 1) They require some information about what portion of the element to sort on. 2) They requires information about how to compare those subelements.

As an example, you might sort a list of lines of text based on the Nth field in each line. So the first information required will be used to determine how to find the Nth field. The second information will be the actual sort method to use for ordering those fields.

USING SORT ROUTINES

All sort routines are named sort_METHOD where METHOD is the name of the method. All sort_METHOD have both a forward and reverse sort:

   $flag = sort_METHOD(\@list [,@args]);
   $flag = sort_rev_METHOD(\@list [,@args]);

where @args are any additional arguments used for the sort. These will be described below.

Corresponding to every sort_METHOD routine is a cmp_METHOD routine which takes two elements and returns a -1, 0, or 1 (similar to the cmp or <=> operators).

   $flag = cmp_METHOD($x,$y [,@args]);
   $flag = cmp_rev_METHOD($x,$y [,@args]);

Finally, there is an alternate way to do the sort/comparison:

   $flag = sort_by_method('METHOD',    \@list [,@args]);
   $flag = sort_by_method('rev_METHOD',\@list [,@args]);

   $flag = cmp_by_method('METHOD',    \@list [,@args]);
   $flag = cmp_by_method('rev_METHOD',\@list [,@args]);

As an example, the following two calls are identical:

   $flag = sort_alphabetic(\@list);
   $flag = sort_by_method('alphabetic',\@list);

The value of $flag for a sort method is undef (if there is an error) or 1 if the sort succeeds (and in this case, @list has been reordered to be in the sorted order). The value of $flag for a cmp method is undef (if there is an error) or -1, 0, or 1.

The contents of @args depends on the type of sort method and are described in the sections below.

UNAMBIGUOUS METHODS

As described above, unambiguous methods do not use any secondary sort methods. For each of these sort methods, the contents of @args are:

   @args = (@method_args, $hash)

and for cmp methods, the contents of @args are:

   @args = (@method_args)

@method_args is a list of arguments that will be passed to the method. Most unambiguous methods do not require any additional arguments, but if they do, they would be here. The list of possible arguments are described in the documentation for each method.

$hash is an optional hash reference. All sort_METHOD functions can be used to sort a list using a hash. For example, in the following case:

   @list = qw(foo bar ick);
   %hash = ( foo => 3, bar => 5, ick => 1 );

   sort_numerical(\@list,\%hash);

would result in @list containing:

   (ick, foo, bar)

since those correspond to numerical values of (1,3,5) respectively.

Each element in @list must be a key in %hash, and the value of that key must be of the appropriate type.

The following methods are supported:

sort_numerical
sort_rev_numerical
cmp_numerical
cmp_rev_numerical
   use Sort::DataTypes qw(:all)

   $flag = sort_numerical(\@list [,@args]);
   $flag = sort_rev_numerical(\@list [,@args]);

   $flag = cmp_numerical($x,$y [,@args]);
   $flag = cmp_rev_numerical($x,$y [,@args]);

These sort/compare numbers. There is little reason to use any of these routines since it would be more efficient to simply call sort as:

   sort { $a <=> $b } @list

but they are included for the sake of completeness, and to make them available for use by the sort_by_method and cmp_by_method routines.

sort_alphabetic
sort_rev_alphabetic
cmp_alphabetic
cmp_rev_alphabetic
   use Sort::DataTypes qw(:all)

   $flag = sort_alphabetic(\@list [,@args]);
   $flag = sort_rev_alphabetic(\@list [,@args]);

   $flag = cmp_alphabetic($x,$y [,@args]);
   $flag = cmp_rev_alphabetic($x,$y [,@args]);

These sort/compare strings alphabetically. As with numerical sorts, there is little reason to call these, and they are included for the sake of completeness.

sort_alphanum
sort_rev_alphanum
cmp_alphanum
cmp_rev_alphanum
   use Sort::DataTypes qw(:all)

   $flag = sort_alphanum(\@list [,@args]);
   $flag = sort_rev_alphanum(\@list [,@args]);

   $flag = cmp_alphanum($x,$y [,@args]);
   $flag = cmp_rev_alphanum($x,$y [,@args]);

These do numeric sort/comparison if two elements are numeric (integer or real) and alphabetic sorts otherwise.

sort_random
sort_rev_random
cmp_random
cmp_rev_random
   use Sort::DataTypes qw(:all)

   $flag = sort_random(\@list [,@args]);
   $flag = sort_rev_random(\@list [,@args]);

   $flag = cmp_random($x,$y [,@args]);
   $flag = cmp_rev_random($x,$y [,@args]);

This randomly shuffles an array in place.

The sort_random and sort_rev_random routines are identical, and are included simply for the situation where the sort routines are being called in some automatically generated code that may add the 'rev_' prefix.

The cmp_random and cmp_rev_random routines simply returns a random -1, 0, or 1.

sort_version
sort_rev_version
cmp_version
cmp_rev_version
   use Sort::DataTypes qw(:all)

   $flag = sort_version(\@list [,@args]);
   $flag = sort_rev_version(\@list [,@args]);

   $flag = cmp_version($x,$y [,@args]);
   $flag = cmp_rev_version($x,$y [,@args]);

These sort a list of version numbers of the form MAJOR.MINOR.SUBMINOR ... (any number of levels are allowed). The following examples should illustrate the ordering:

   1.1.x < 1.2 < 1.2.x  Numerical versions are compared first at
                        the highest level, then at the next highest,
                        etc. The first non-equal compare sets the
                        order.
   1.a < 1.b            Alphanumeric levels that start with a letter
                        are compared alphabetically.
   1.2a < 1.2 < 1.03a   Alphanumeric levels that start with a number
                        are first compared numerically with only the
                        numeric part. If they are equal, alphanumeric
                        levels come before purely numerical levels.
                        Otherwise, they are compared alphabetically.
   1.a < 1.2a           An alphanumeric level that starts with a letter
                        comes before one that starts with a number.
   1.01a < 1.1a         Two alphanumeric levels that are numerically
                        equal in the number part and equal in the
                        remaining part are compared alphabetically.
sort_date
sort_rev_date
cmp_date
cmp_rev_date
   use Sort::DataTypes qw(:all)

   $flag = sort_date(\@list [,@args]);
   $flag = sort_rev_date(\@list [,@args]);

   $flag = cmp_date($x,$y [,@args]);
   $flag = cmp_rev_date($x,$y [,@args]);

These sort/compare a list of dates. Dates are anything that can be parsed with Date::Manip.

It should be noted that the dates will only be parsed a single time, so it is not necessary to pre-parse them for performance reasons.

sort_ip
sort_rev_ip
cmp_ip
cmp_rev_ip
   use Sort::DataTypes qw(:all)

   $flag = sort_ip(\@list [,@args]);
   $flag = sort_rev_ip(\@list [,@args]);

   $flag = cmp_ip($x,$y [,@args]);
   $flag = cmp_rev_ip($x,$y [,@args]);

These sort/compare IP numbers. Each value can be a pure IP (in the form A.B.C.D) or a CIDR notation which includes the netmask (A.B.C.D/MASK).

When comparing CIDR representations, if the IP part of two elements is identical, the following two rules are used:

   an element without a mask comes before one that has a mask

   two elements with masks are sorted by mask

So the following elements are in sorted order:

   10.20.30.40 < 10.20.30.40/4 < 10.20.30.40/16
sort_nosort
sort_rev_nosort
cmp_nosort
cmp_rev_nosort
   use Sort::DataTypes qw(:all)

   $flag = sort_nosort(\@list [,@args]);
   $flag = sort_rev_nosort(\@list [,@args]);

   $flag = cmp_nosort($x,$y [,@args]);
   $flag = cmp_rev_nosort($x,$y [,@args]);

These leave the list unchanged. This primarily useful as an alternative sort method if you do not wish to sort beyond a method that is ambiguous.

sort_function
sort_rev_function
cmp_function
cmp_rev_function
   use Sort::DataTypes qw(:all)

   $flag = sort_function(\@list [,@args]);
   $flag = sort_rev_function(\@list [,@args]);

   $flag = cmp_function($x,$y [,@args]);
   $flag = cmp_rev_function($x,$y [,@args]);

This is a catch-all sort function. @method_args contains a single argument. It is either a coderef or the name of a function suitable to compar two elements and return -1, 0, or 1 depending on the order of the elements.

The following both work:

   $flag = sort_function(\@list,\&somefunc);
   $flag = sort_function(\@list,"somefunc");

If the function is passed in by name, it must be in the calling programs namespace OR it must be passed in as a fully specified function name including package (i.e. "package::functionname").

AMBIGUOUS METHODS

As described above, ambiguous methods do use a secondary sort methods. For these sort methods, the contents of @args are:

   @args = (@method_args, $hash, @extra_cmp_info)

and for cmp methods, the contents of @args are:

   @args = (@method_args, @extra_cmp_info)

@method_args and $hash are similar to those described above for unambiguous methods.

The contents of @extra_cmp_info are:

   @extra_cmp_info  = ( [$method, @method_args],
                        [$method, @method_args],
                        ...
                      )

Since an ambiguous method cannot always determine the order of two elements, a backup method (or methods) may be specified. The backup sort method contains a method name ($method) and any arguments required for that method. The method must be either ambiguous or unambiguous. If it is ambiguous, an additional backup method may be used. If a method is unambiguous, no additional sort methods should be included.

If a backup method is not supplied for an ambiguous method, a default method will be used (typically alphabetic).

For the example where you sort strings by length, if you want to sort all elements of the same length randomnly, you could use the following sort:

   sort_length(\@list, ['random']);

The following methods are supported:

sort_length
sort_rev_length
cmp_length
cmp_rev_length
   use Sort::DataTypes qw(:all)

   $flag = sort_length(\@list [,@args]);
   $flag = sort_rev_length(\@list [,@args]);

   $flag = cmp_length($x,$y [,@args]);
   $flag = cmp_rev_length($x,$y [,@args]);

These take strings and compare them by length. If they are the same length, it sorts them by a secondary method (which defaults to 'alphabetic').

SPLIT-ELEMENT METHODS

As described above, split-element methods split an element into pieces, and each of the pieces are compared separately using a secondary sort method.

For these sort methods, the contents of @args are:

   @args = (@method_args, $hash, @extra_sort_info)

and for cmp methods, the contents of @args are:

   @args = (@method_args, @extra_cmp_info)

@method_args and $hash are similar to those described for unambiguous methods.

A split-element method is not truly a sort method. It is simply a method for splitting an element into parts. Then, every part must be sorted.

As such, every split-element method will use other sort methods for actually sorting the pieces. If no @extra_sort_info or @extra_cmp_info is supplied, it will typically default to alphabetic sort.

If other sort methods are supplied, any other ambiguous, or unambiguous method may be supplied.

It should be understood that all pieces are compared using the same sort methods. In other words, you cannot split an element into pieces and compare the first set alphabetically, the second numerically, and the third as dates. To do this, you have to use the partial-element methods described next.

Another note is that if a piece is empty in one element and not in the other, the empty one will sort before the filled one (unless a reverse sort is being done).

Once the element is split into pieces, they may be compared starting at the leftmost piece:

  a:b:c < a:c:d

or starting at the rightmost piece:

  c:b:a < a:b:c

It should be noted that if an element is missing a piece, it will always come BEFORE an element that has the piece (unless it's a reverse sort in which case it will come after.

As an example, if you are sorting strings containing colon separated pieces, the following order will be used:

   a::c < a:c:d

since the second piece is missing in the first element. Likewise:

   a:b < a:b:c

since the third piece is missing in the first element.

The following split-element methods exist:

sort_split
sort_rev_split
cmp_split
cmp_rev_split
   use Sort::DataTypes qw(:all)

   $flag = sort_split(\@list [,@args]);
   $flag = sort_rev_split(\@list [,@args]);

   $flag = cmp_split($x,$y [,@args]);
   $flag = cmp_rev_split($x,$y [,@args]);

The @method_args segments of the arguments contain two optional arguments.

The first argument is either 'lms' or 'rms' (all options are case sensitive, so they must be entered lowercase). If 'lms' is given, pieces are sorted starting at the left. If 'rms' is given, they are sorted from the right. 'lms' is the default.

The second argument is a regexp. It can be passed in as a string that will be turned into a regular expression, or as an actaul regexp, so one argument could be either of:

   \s+
   qr/\s+/

If no regexp is passed in, it defaults to

   qr/\s+/

The following functions are also included for backward compatibility with previous versions of this module.

These are deprecated, and may be removed at some point in the future.

These can all be done trivially with the split functions listed above (and all are coded as wrappers around those functions), so slightly better performance can be obtained by using the split functions directly.

sort_domain
sort_rev_domain
cmp_domain
cmp_rev_domain
   use Sort::DataTypes qw(:all)

   $flag = sort_domain(\@list [,@args]);
   $flag = sort_rev_domain(\@list [,@args]);

   $flag = cmp_domain($x,$y [,@args]);
   $flag = cmp_rev_domain($x,$y [,@args]);

Domain sorting is equivalent to split-element sorting with the priority of 'rms' and a regular expression of qr/\./ . In other words, the following are equivalent:

   $flag = sort_domain(\@list);
   $flag = sort_split(\@list,'rms',qr/\./);

A single argument can be passed in in @method_args containing an alternate regular expression if the elements should be split on something other than dots, but the priority will always be 'rms'.

Since the most significant subvalue in the domain is at the right, any domain ending with ".com" would come before any domain ending in ".edu".

   a.b < z.b < a.bb < z.bb < a.c
sort_numdomain
sort_rev_numdomain
cmp_numdomain
cmp_rev_numdomain
   use Sort::DataTypes qw(:all)

   $flag = sort_numdomain(\@list [,@args]);
   $flag = sort_rev_numdomain(\@list [,@args]);

   $flag = cmp_numdomain($x,$y [,@args]);
   $flag = cmp_rev_numdomain($x,$y [,@args]);

A related type of sorting is numdomain sorting. This is identical to domain sorting except that if two elements in the domain are numerical, numerical sorts will be done. So:

  a.2.c < a.11.c

It should be noted that if a field may be either numeric or alphanumeric, sorting with this method may yield unexpected results. For example, sorting the three elements:

  a.1.b
  a.2.b
  a.X.b

will use numeric comparisons when comparing the 2nd field of the first and second elements, but it will use alphabetic comparisons when comparing the first and third elements (or the second and third elements).

sort_path
sort_rev_path
cmp_path
cmp_rev_path
   use Sort::DataTypes qw(:all)

   $flag = sort_path(\@list [,@args]);
   $flag = sort_rev_path(\@list [,@args]);

   $flag = cmp_path($x,$y [,@args]);
   $flag = cmp_rev_path($x,$y [,@args]);

Path sorting is equivalent to split-element sorting with the priority of 'lms' and a regular expression of qr/\// . In other words, the following are equivalent:

   $flag = sort_path(\@list);
   $flag = sort_split(\@list,'lms',qr/\//);

A single argument can be passed in in @method_args containing an alternate regular expression if the elements should be split on something other than slashes, but the priority will always be 'lms'.

Since the most significant element in the domain is at the left, you get the following behavior:

   a/b < a/z < aa/b < aa/z < b/b

When sorting lists that have a mixture of relative paths and explicit paths, the explicit paths will come first. So:

   /b/c < a/b
sort_numpath
sort_rev_numpath
cmp_numpath
cmp_rev_numpath
   use Sort::DataTypes qw(:all)

   $flag = sort_numpath(\@list [,@args]);
   $flag = sort_rev_numpath(\@list [,@args]);

   $flag = cmp_numpath($x,$y [,@args]);
   $flag = cmp_rev_numpath($x,$y [,@args]);

A related type of sorting is numpath sorting. This is identical to path sorting except that if two elements in the path are numbers, numerical sorts will be done. So:

   a/2/c < a/11/c

PARTIAL-ELEMENT METHODS

Partial-element sorting is, as described above, to split the element into fields and then compare based on the Nth field. In addition, you are allowed to sort one field in one way, and a second field in an entirely different way.

For example, you could sort lines of the format:

   2010-01-30  Smith  John
   2010-01-30  Smith  Adam

first by date (the 1st field), alphabetically by last name (2nd field), and alphabetically by first name (3rd field).

For these sort/cmp methods, the contents of @args are:

   @args = ( $sep, [@field_args], [@field_args], ...)

$sep is a regular expression used to split an element into fields. It can be entered as either a regular expression or a string that is turned into a regular expression:

   qr/\s+/
   \s+

It is optional, and defaults to qr/\s+/ (i.e. split on whitespace).

@field_args describes how to sort one of the fields. It is of the form:

   @field_args = ( $n, $hash, @extra_cmp_info )

where $n is an integer and tells which field to sort (fields start at 0), $hash is an optional hashref to use for this field (it's keys are the values of the field, NOT the values of the element), and @extra_cmp_info is described in the ambiguous methods section above:

   @extra_cmp_info  = ( [$method, @method_args],
                        [$method, @method_args],
                        ...
                      )
Sort methods must be either ambiguous or unambiguous.

To sort the above example (by date, last name, and first name), you could use:

   $flag = sort_partial(\@list, qr/\s+/, [1, ['date']],
                                         [2, ['alphabetic']],
                                         [3, ['alphabetic']]);
sort_partial
sort_rev_partial
cmp_partial
cmp_rev_partial
   use Sort::DataTypes qw(:all)

   $flag = sort_partial(\@list [,@args]);
   $flag = sort_rev_partial(\@list [,@args]);

   $flag = cmp_partial($x,$y [,@args]);
   $flag = cmp_rev_partial($x,$y [,@args]);

This is the basic partial-element sort routine.

The following functions are also included for backward compatibility with previous versions of this module.

These are deprecated, and may be removed at some point in the future.

These can all be done trivially with the partial functions listed above (and all are coded as wrappers around those functions), so slightly better performance can be obtained by using the split functions directly.

sort_line
sort_rev_line
cmp_line
cmp_rev_line
   use Sort::DataTypes qw(:all)

   $flag = sort_line(\@list,$n [,$sep,] [,\%hash]);
   $flag = sort_rev_line(\@list,$n [,$sep] [,\%hash]);

   $flag = cmp_line($x,$y,$n [,$sep]);
   $flag = cmp_rev_line($x,$y,$n [,$sep]);

These take a list of lines and sort on the Nth field using $sep as the regular expression splitting the lines into fields. Fields are numbered starting at 0. If no $sep is given, it defaults to white space.

This is included for backward compatibility only and does not allow sorting on more than one field, or specifying the sort method for that field. It is recommended that you use the partial methods above.

sort_numline
sort_rev_numline
cmp_numline
cmp_rev_numline
   use Sort::DataTypes qw(:all)

   $flag = sort_numline(\@list,$n [,$sep,] [,\%hash]);
   $flag = sort_rev_numline(\@list,$n [,$sep] [,\%hash]);

   $flag = cmp_numline($x,$y,$n [,$sep]);
   $flag = cmp_rev_numline($x,$y,$n [,$sep]);

These are similar but will sort numerically if the Nth field is numerical, and alphabetically otherwise.

MISC. ROUTINES

sort_valid_method
cmp_valid_method
   use Sort::DataTypes qw(:all)

   $flag = sort_valid_method($string);
   $flag = cmp_valid_method($string);

These are identical and return 1 if there is a valid sort method named $string in the module. For example, there is a function "sort_numerical" defined in this modules, but there is no function "sort_foobar", so the following would occur:

   sort_valid_method("numerical")
      => 1

   sort_valid_method("rev_numerical")
      => 1

   sort_valid_method("foobar")
      => 0

Note that the methods must NOT include the "sort_" or "cmp_" prefix, but the "rev_" prefix is allowed as shown in the example.

sort_by_method
cmp_by_method
   use Sort::DataTypes qw(:all)

   $flag = sort_by_method($method,\@list [,@args]);
   $flag = cmp_by_method ($method,$ele1,$ele2 [,@args]);

These sort a list, or compare two elements, using the given method (which is any string which returns 1 when passed to sort_valid_method).

If the method is not valid, the list is left untouched.

BACKWARDS INCOMPATIBILITIES

The following are a list of backwards incompatibilities.

Version 2.00 handling of hashes

In version 1.xx, when sorting by hash, the hash was passed in as the hash. As of 2.00, it is passed in by reference to avoid any confusion with optional arguments.

KNOWN PROBLEMS

None at this point.

LICENSE

This script is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

AUTHOR

Sullivan Beck (sbeck@cpan.org)