The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Data::Range::Compare::Stream::Iterator::File::MergeSortAsc - On Disk Merge Sort for really big data sets!

SYNOPSIS

  use Data::Range::Compare::Stream;
  use Data::Range::Compare::Stream::Iterator::File;
  use Data::Range::Compare::Stream::Iterator::File::MergeSortAsc;

  my $iterator=Data::Range::Compare::Stream::Iterator::File::MergeSortAsc->new(
    filename=>'somefile.csv',
  );

  while($iterator->has_next) {
    my $next_range=$iterator->get_next;
    print $next_range,"\n";
  }

DESCRIPTION

This module Extends Data::Range::Compare::Stream::Iterator::Base and provides an on disk merge sort for objects that implement or extend Data::Range::Compare::Stream::Iterator::Base.

OO Methods

  • my $iterator=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(key=>value);

    Instance Constructor, all arguments are optional

    At least one of the following Argument(s) is required:

      filename=>'source_file.csv'  
        # the file is assumed to be an absolute or relative path to the file location.
    
      file_list=>[]
        # An array ref of file names in absolute or relative paths
          
      iterator_list=>[]
       # an array ref of objects that implement or extend Data::Range::Compare::Stream::Iterator::Base

    Optional Arguments:

       auto_prepare=>0|1
         # Default: 0, If set to 1 sort operations happen on object creation.
    
       unlink_result_file=>1|0
         # Default: 1, If set to 0 the sorted result file will not be deleted
    
       bucket_size=>4000
         # sets the number of ranges to be pre-sorted
         # 2 buckets are created.. so the number of objects loaded into is bucked_size * 2
    
       NEW_ITERATOR_FROM=>'Data::Range::Compare::Stream::Iterator::File'
         # sets the file iterator object to be used when loading spooled files for merging
         # make sure you load or require the object class being passed in as an argument!
    
       NEW_ARRAY_ITERATOR_FROM=>'Data::Range::Compare::Stream::Iterator::Array'
         # sets the array iterator class
    
       NEW_FROM=>'Data::Range::Compare::Stream',
         # depricated but still supportd, see factory_instance.
         # sets the object class new ranges will be created from
         # This argument is passed to objects being constructed from: NEW_ITERATOR_FROM
    
       factory_instance =>$obj
         # defines the object that implements the $obj->factory($start,$end,$data).
         # new ranges are constructed from the factory interfcae.  If a factory interface
         # is not created an instance of Data::Range::Compare::Stream is assumed.
    
    
       parse_line=>undef|code_ref
         # Default: undef, Sets the code ref to be used when parsing a line
         # if not set the default internals will be used
         # This argument is passed to objects being constructed from: NEW_ITERATOR_FROM
    
       result_to_line=>undef|code_ref
         # Default: undef, Sets the code ref used to convert a result to a line that can be parsed
         # if not set the default internals will be used
         # This argument is passed to objects being constructed from: NEW_ITERATOR_FROM
    
       sort_func=>undef|code ref
         # Default: undef, Sets the code ref used for comparing objects in the sort process
         # if not set the default internals are used.
    
      tmpdir=>undef|'/some/folder'
          # tmpdir is defined its value is passed to to File::Temp->new(DIR=>$self->{tmpdir});
  • my $class=$iterator->NEW_FROM;

    Returns the Class that new Range objects are constructed from.

  • my $class=$iterator->NEW_ITERATOR_FROM;

    $class will contain the name of the class new file Iterators are to be constructed from.

  • my $class=$iterator->NEW_ARRAY_ITERATOR_FROM;

    $class will contain the name of the class new array Iterators are constructed from.

  • while($iterator->has_next) { ... }

    Returns true when there are more rows to fetch.

  • my $result=$iterator->get_next;

    Returns the next $result from the given source file.

  • my $line=$iterator->result_to_line($range);

    Given a $result from $iterator->get_next, this interface converts the $range object into a line that can be parsed by $iterator->parse_line($line). Think of this function as a data serializer for range objects generated by an $iterator object. When overloading this function or using a call back make sure result_to_line can be parsed by parse_line.

      sub result_to_line {
        my ($self,$result)=@_;
        return $self->{result_to_line}->($result) if defined($self->{result_to_line});
    
        my $range=$result->get_common;
        my $line=$range->range_start_to_string.' '.$range->range_end_to_string."\n";
        return $line;
      }
  • my $ref=$iterator->parse_line($line);

    Given a $line returns the arguments required to construct an object that extends or implements Data::Range::Compare::Stream. When overloading or passing in constructor arguments that provide a call back make sure result_to_line produces the expected line parse_line expects.

      sub parse_line {
        my ($self,$line)=@_;
        return $self->{parse_line}->($line) if defined($self->{parse_line});
        chomp $line;
        [split /\s+/,$line];
      }
  • my $cmp=$iterator->sort_method($left_range,$right_range);

    This is the internal object compare function used when sorting.

      sub sort_method {
        my ($self,$left_range,$right_range)=@_;
        
        return $self->{sort_func}->($left_range,$right_range) if $self->{sort_func};
        my $cmp=sort_in_consolidate_order_asc($left_range->get_common,$right_range->get_common);
    
        return $cmp;
      }

SEE ALSO

Data::Range::Compare::Stream::Cookbook

AUTHOR

Michael Shipper

Source-Forge Project

As of version 0.001 the Project has been moved to Source-Forge.net

Data Range Compare https://sourceforge.net/projects/data-range-comp/

COPYRIGHT

Copyright 2011 Michael Shipper. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.