The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Sort::External - sort huge lists

VERSION

0.04

This is ALPHA release software. The interface may change. However, it's simple enough that it probably won't stay in alpha very long. Please drop a line to the author if you are using it successfully -- a couple happy customers and we'll move from alpha to beta.

SYNOPSIS

    my $sortex = Sort::External->new;
    while (<HUGEFILE>) {
        $sortex->feed( $_ );
    }
    $sortex->finish;
    my $stuff;
    while (defined($stuff = $sortex->fetch)) {
        &do_stuff_with( $stuff );
    }

DESCRIPTION

Problem: You have a list which is too big to sort in-memory. Solution: Use Sort::External, the closest thing to a drop-in replacement for Perl's sort() function when dealing with unmanageably large lists.

Where's the sortex() function?

In a perfect world, Sort::External would export a sortex() function that you could swap with sort() and be done. That isn't possible, because it would have to return a list which would, in all likelihood, be too large to fit in memory -- otherwise, why use Sort::External in the first place?

Replacing sort() with Sort::External

When you replace sort() with the "feed, finish, fetch" cycle of a Sort::External object, there are two things to watch out for.

  • -line_separator -- Sort::External uses temp files to cache sortable items. If each item is terminated by a newline/CRLF, as would be the case for lines from a text file, then Sort::External has no problem figuring out where items begin and end when reading back from disk. If that's not the case, you need to set a -line_separator, as documented below.

  • undef values and references -- Perl's sort() function sorts undef values to the front of a list -- it will complain if you have warnings enabled, but it preserves their undef-ness. sort() also preserves references. In contrast, Sort::External's behavior is unpredictable and almost never desirable when you feed it either undef values or references. If you really care about sorting lists containing undefs or refs, you'll have to symbollically replace and restore them yourself.

Memory management

By default, Sort::External's cache size is 10,000 items. If your items are large, you may need to decrease the cache size; if they are small, you might improve Sort::External's performance somewhat by increasing the cache size.

Because Perl doesn't give you the kind of responsibility for memory management that C does (thank goodness), there isn't a reliable way to flush the cache automatically based on analyzing memory consumption that doesn't impose a severe penalty on performance. If you want to max out the speed of Sort::External, here are two tips: 1) maximum memory usage will likely occur when finish() is called, and 2) don't cut things close, because there isn't that much to gain and there's a lot to lose if you go over the edge and start hitting swap.

METHODS

new()

    my $sortscheme = sub { $Sort::External::b <=> $Sort::External::a };
    my $sortex = Sort::External->new(
        -sortsub         => $sortscheme,      # default sort: standard lexical
        -working_dir     => $temp_directory,  # default: see below
        -line_separator  => 'random',         # default: $/
        -cache_size      => 100_000,          # default: 10_000;
        );

Construct a Sort::External object.

  • -sortsub -- A sorting subroutine. Be advised that you MUST use $Sort::External::a and $Sort::External::b instead of $a and $b in your sub.

  • -working_dir -- The directory where the temporary sortfiles will reside. By default, this directory is created using File::Temp's tempdir() command.

  • -line_separator -- By default, Sort::External assumes that your items are already terminated with a newline/CRLF or whatever your system considers to be a line ending (see the perlvar documentation for $/). If that's not true, you have two options: 1) specify your own value for -line_separator (which Sort::External will append to each item when storing and chomp() away upon retrieval) or 2) specify 'random', in which case Sort::External will use a random 16-byte string suitable for delimiting arbitrary binary data.

  • -cache_size -- The size for each of Sort::External's caches, in sortable items.

feed()

    $sortex->feed( @items );

Feed one or more sortable items to your Sort::External object. It is normal for occasional pauses to occur as sortfiles are merged.

finish()

    $sortex->finish( -outfile => 'sorted.txt' );
    ### or, if you intend to call fetch...
    $sortex->finish; 

Prepare to output items in sorted order. If you haven't yet exceeded the cache size, Sort::External never writes to disk -- it just sorts the items in-memory.

If you specify the parameter -outfile, Sort::External will attempt to write your sorted list to that outfile (it will croak() if the file already exists).

Note that you can either have finish() write to an outfile, or finish() then fetch()... but not both.

fetch()

    while (my $stuff = $sortex->fetch) {
        &do_stuff_with( $stuff );
    }

Fetch the next sorted item.

DISCUSSION

"internal" vs. "external" sorting

In the CS world, "internal sorting" refers to sorting data in RAM, while "external sorting" refers to sorting data which is stored on disk, tape, or any storage medium except RAM. The main goal when implementing an external sorting algorithm is to minimize disk I/O. Sort::External's routine can be summarized like so:

Cache sortable items in memory. Every X items, sort the cache and empty it into a temporary sortfile. As sortfiles accumulate, interleave them periodically into larger sortfiles. Use caching extensively during the interleaving process to minimize I/O. Complete the sort by emptying the input cache then interleaving the contents of all existing sortfiles into an output stream.

BUGS

Please report any bugs or feature requests to bug-sort-external@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Sort-External.

ACKNOWLEDGEMENTS

The code in Sort::External was originally developed as part of the Search::Kinosearch suite. Chris Nandor helped debug that early version and made some excellent suggestions which have been incorporated into the present distribution.

SEE ALSO

File::Sort, File::MergeSort, and Sort::Merge as possible alternatives.

AUTHOR

Marvin Humphrey <marvin at rectangular dot com> http://www.rectangular.com

COPYRIGHT

Copyright (c) 2005 Marvin Humphrey. All rights reserved. This module is free software. It may be used, redistributed and/or modified under the same terms as Perl itself.