Michael Shipper > Data-Range-Compare-Stream-4.029 > Data::Range::Compare::Stream::Cookbook::CustomFileFormat

Download:
Data-Range-Compare-Stream-4.029.tar.gz

Annotate this POD

CPAN RT

New  1
Open  0
View/Report Bugs
Source  

NAME ^

Data::Range::Compare::Stream::Cookbook::CustomFileFormat - HOW TO Change the Parser Functionality

SYNOPSIS ^

  use Data::Range::Compare::Stream;
  use Data::Range::Compare::Stream::Iterator::File::MergeSortAsc;
  use Data::Range::Compare::Stream::Iterator::Compare::Asc;
  use Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn;
  
  my $cmp=new Data::Range::Compare::Stream::Iterator::Compare::Asc;
  
  sub parse_file_one {
    my ($line)=@_;
    my @list=split /\s+/,$line;
    return [@list[4,5],$line]
  }
  
  sub parse_file_two {
    my ($line)=@_;
    my @list=split /\s+/,$line;
    return [@list[2,3],$line]
  }
  
  sub range_to_line {
    my ($range)=@_;
    return $range->data;
  }
  
  my $file_one=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(
    result_to_line=>\&range_to_line,
    parse_line=>\&parse_file_one,
    filename=>'custom_file_1.src',
  );
  
  my $file_two=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(
    result_to_line=>\&range_to_line,
    parse_line=>\&parse_file_two,
    filename=>'custom_file_2.src',
  );
  
  my $set_one=new Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn($file_one,$cmp);
  my $set_two=new Data::Range::Compare::Stream::Iterator::Consolidate::OverlapAsColumn($file_two,$cmp);
  
  $cmp->add_consolidator($set_one);
  $cmp->add_consolidator($set_two);
  
  while($cmp->has_next) {
    my $result=$cmp->get_next;
    next if $result->is_empty;
  
    my $ref=$result->get_root_results;
    next if $#{$ref->[0]}==-1;
    next if $#{$ref->[1]}==-1;
  
    foreach my $overlap (@{$ref->[0]}) {
      print $overlap->get_common->data;
    }
  
  }

DESCRIPTION ^

This pod explains how to create custom call backs for various file formats and sort very large data files: These examples are for Data::Range::Compare::Stream::Iterator::File and Data::Range::Compare::Stream::Iterator::File::MergeSortAsc.

What we want to accomplish. ^

File 1 Format

Column 5 represents the starting value for our range. Column 6 represents the ending value for our range.

  101_#2          1       2    F0       263        278        2       1.5
  102_#1          1       6    F1       766        781        1       1.0
  103_#1          2       15   V1       526        581        1       0.0
  103_#1          2       9    V2       124        134        1       1.3
  104_#1          1       12   V3       137        172        1       1.0
  105_#1          1       17   F2       766        771        1       1.0

File 2 Format

Column 3 represents the starting value of our range. Column 4 represents the ending value for our range.

  97486   9   262           279
  67486   9   118           119
  87486   9   183           185
  248233  9   124           134

Creating Functions to parse the file formats ( Problem_1 Solution )

Parsing files 1 and 2 isn't that difficult, but the question raised is where is the original line going to be saved? Lucky for us Data::Range::Compare::Stream offers a data function for associating custom data with a range.

The actual Constructor for Data::Range::Compare::Stream is a 2 or 3 argument list. Arguments 1 and 2 are the start and end values of our range, but argument 3 is an optional value that can be accessed via the $range->data function.

So our parser for each file format is slightly different.

Sorting our massive files ( Solution to Problem_2 )

As stated our files are far to large to be sorted in memory. Fortunately Data::Range::Compare::Stream offers an on disk Merge-Sort feature, but to use it we will need to convert our ranges back to their original format they were parsed from.

The parser function for both File1 and File2 save the original line in $range->data. Our serialization function simply needs to return the value from $range->data.

  sub range_to_line {
    my ($range)=@_;
    return $range->data;
  }

File1 Sorted Iterator Example:

  my $file_one=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(
    result_to_line=>\&range_to_line,
    parse_line=>\&parse_file_one,
    filename=>'custom_file_1.src',
  );

File2 Sorted Iterator Example:

  my $file_two=new Data::Range::Compare::Stream::Iterator::File::MergeSortAsc(
    result_to_line=>\&range_to_line,
    parse_line=>\&parse_file_two,
    filename=>'custom_file_2.src',
  );

Showing what ranges overlap ( Solving Problem_3 )

Our data in File1 given the sample contains overlaps, we can safely assume File22 will as well. Given that fact we will need to retain each range overlap as the files are iterated through.

AUTHOR ^

Michael Shipper

Source-Forge Project ^

As of version 0.001 the Project has been moved to Source-Forge.net

Data Range Compare https://sourceforge.net/projects/data-range-comp/

COPYRIGHT ^

Copyright 2011 Michael Shipper. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

syntax highlighting: