Casiano Rodriguez-Leon > GRID-Machine-0.127 > GRID::Machine::perlparintro

Download:
GRID-Machine-0.127.tar.gz

Annotate this POD

CPAN RT

Open  0
View/Report Bugs
Source  

NAME ^

GRID::Machine::perlparintro - A brief and basic introduction to Parallel Distributed Computing in Perl

SYNOPSIS ^

    $ time gridpipes.pl 1 1000000000
    Process 0: machine = beowulf partial = 3.141593 pi = 3.141593
    Pi = 3.141593. N = 1000000000 Time = 27.058693

    real    0m28.917s
    user    0m0.584s
    sys     0m0.192s

    pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 2 1000000000
    Process 0: machine = beowulf partial = 1.570796 pi = 1.570796
    Process 1: machine = orion partial = 1.570796 pi = 3.141592
    Pi = 3.141592. N = 1000000000 Time = 15.094719

    real    0m17.684s
    user    0m0.904s
    sys     0m0.260s


    pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 3 1000000000
    Process 0: machine = beowulf partial = 1.047198 pi = 1.047198
    Process 1: machine = orion partial = 1.047198 pi = 2.094396
    Process 2: machine = nereida partial = 1.047198 pi = 3.141594
    Pi = 3.141594. N = 1000000000 Time = 10.971036

    real    0m13.700s
    user    0m0.952s
    sys     0m0.240s

SUMMARY ^

The total computational power of institutions as a whole has dramatically rised in the last decades, but due to distributed ownership and administration restrictions, individuals are not able to capitalize such computing power. Many machines sit idle for very long periods of time while their owners are busy doing other things. Many of them run some sort of UNIX, have Perl installed and provide SSH access.

If such is your scenario you can use GRID::Machine to have perl interpreters running in those nodes and make them collaborate to give you more computational power and more fun. All this without having to ask administrators or having to install any additional software.

This tutorial introduces the basics of parallel computing by means of a simple program that distributes the evaluation of some mathematical expression between several machines. The computational results show that - when the problem is large enough - a substantial improving is gained in performance: The execution times is reduced to the half by using two machines.

REQUIREMENTS ^

To experiment with the examples in this tutorial you will need at least two Unix machines with Perl and SSH. If you are not familiar with Perl or Linux this module probably isn't for you. If you are not familiar with SSH, see

BUILDING A "FUZZY" PARALLEL CLUSTER ^

SSH includes the ability to authenticate users using public keys. Instead of authenticating the user with a password, the SSH server on the remote machine will verify a challenge signed by the user's private key against its copy of the user's public key. To achieve this automatic ssh-authentication you have to:

A PARALLEL ALGORITHM ^

In this manual we have used several times as an example the computation of an approach to number Pi (3.14159...) using numerical integration. To understand it, take into account that the area under the curve 1/(1+x**2) between 0 and 1 is Pi/4 = (3.1415...)/4 as it shows the following debugger session:

  pp2@nereida:~/public_html/cgi-bin$ perl -wde 0
  main::(-e:1):   0
    DB<1>  use Math::Integral::Romberg 'integral'
    DB<2> p integral(sub { my $x = shift; 4/(1+$x*$x) }, 0, 1);
  3.14159265358972

The module Math::Integral::Romberg provides the function integral that allow us to compute the area of a given function in some interval. In fact - if you remember your high school days - it is easy to see the reason: the integral of 4/(1+$x*$x) is 4*arctg($x) and so its area between 0 and 1 is given by:

        4*(arctg(1) - arctg(0)) = 4 * arctg(1) = 4 * Pi / 4 = Pi

This is not, in fact, a good way to compute Pi, but makes a good example of how to exploit several machines to fulfill a task.

To compute the area under 4/(1+$x*$x) we have divided up the interval [0,1] into sub-intervals of size 1/N and add up the areas of the small rectangles with base 1/N and height the value of the curve 4/(1+$x*$x) in the middle of the interval. The following debugger session illustrates the idea:

  pp2@nereida:~$ perl -wde 0
  main::(-e:1):   0
  DB<1> use List::Util qw(sum)
  DB<2> $N = 6
  DB<3> @divisions = map { $_/$N } 0..($N-1)
  DB<4> sub f { my $x = shift; 4/(1+$x*$x) }
  DB<5> @halves = map { $_+0.5/$N } @divisions
  DB<6> $area = sum(map { f($_)/$N } @halves)
  DB<7> p $area
  3.14390742722244

To optimize the execution time we distribute the sum in line 6 $area = sum(map { f($_)/$N } @halves) among several processes. The processes are numbered from 0 to np-1. Each process sums up the areas of roughly N/np intervals. We can spedup the computation if each process is allocated to a different processor (or core).

To achieve a higher performance the code to execute on each machine is written in C:

  ~/src/perl/grid-machine/examples$ cat -n pi.c
     1  #include <stdio.h>
     2  #include <stdlib.h>
     3  
     4  main(int argc, char **argv) {
     5    int id, N, np, i;
     6    double sum, left;
     7  
     8    if (argc != 4) { 
     9      printf("Usage:\n%s id N np\n",argv[0]);
    10      exit(1); 
    11    }
    12    id = atoi(argv[1]);
    13    N = atoi(argv[2]);
    14    np = atoi(argv[3]);
    15    for(i=id, sum = 0; i<N; i+=np) {
    16      double x = (i + 0.5)/N;
    17      sum += 4 / (1 + x*x);
    18    }
    19    sum /= N;
    20    printf("%lf\n", sum);
    21    exit(0);
    22  }

The program receives (lines 8-14) three arguments: The first one, id, identifies the machine with a logical number, the second one, N, is the total number of intervals, the third np is the number of machines being used. Notice the for loop at line 15: Processor id sums up the areas corresponding to intervals id, id+np, id+2*np, etc. The program concludes writing to STDOUT the partial sum.

Observe that, since we aren't using infinite precision numbers errors introduced by rounding and truncation imply that increasing N would not lead to a more precise evaluation of Pi.

To get the executable we have a simple Makefile:

  pp2@nereida:~/LGRID_Machine/examples$ cat -n Makefile
     1  pi:
     2          cc pi.c -o pi

PARALLEL COMPUTING MADE EASY: USING GRID::Machine::Group ^

The GRID::Machine::Group module provides Parallel Remote Procedure Calls (RPC) to a cluster of machines via a SSH connection. The result of a remote call is a GRID::Machine::Group::Result object.

The call at line 31 returns a new instance of a GRID::Machine::Group object. The object is blessed in a unique class that inherits from GRID::Machine::Group. That is, the new object is a singleton. When later the cluster object GRID::Machine::Group is provided with new methods, those are installed in the singleton class.

  $ cat -n pi7.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3  use GRID::Machine;
     4  use GRID::Machine::Group;
     5  use List::Util qw(sum);
     6  
     7  my @MACHINE_NAMES = split /\s+/, $ENV{MACHINES};
     8  my $code = << 'EOFUNCTION';
     9     double sigma(int id, int N, int np) {
    10       double sum = 0;
    11       int i;
    12       for(i = id; i < N; i += np) {
    13           double x = (i + 0.5) / N;
    14           sum += 4 / (1 + x * x);
    15       }
    16       sum /= N; 
    17       return sum;
    18     }
    19  EOFUNCTION
    20  ;
    21  
    22  my @m = map { 
    23                GRID::Machine->new(
    24                   host => $_, 
    25                   wait => 5, 
    26                   uses => [ qq{Inline  'C' => q{$code}} ],
    27                   survive => 1,
    28                ) 
    29              } @MACHINE_NAMES;
    30  
    31  my $c = GRID::Machine::Group->new(cluster => [ @m ]);
    32  
    33  my ($N, $np, $pi)  = (1000, 4, 0);
    34  
    35  my @args = map {  [$_, $N, $np] } 0..$np-1;
    36  
    37  my @r = $c->eval(q{ sigma(@_) }, args => \@args)->Results;
    38  
    39  print sum(@r)."\n";

The eval method of GRID::Machine::Group (see line 37 in the example below) computes some code on each of the machines belonging to the cluster object.

In this example, Inline::C is used in the remote machines to make the computation more efficient. We assume that Inline::C is already installed in the remote machines.

Performance

The following code is a modification of the canonical example computing pi. Timers have been included (lines 28 and 31) to see the influence of the number of processors:

  $ cat -n pi.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3  use GRID::Machine;
     4  use GRID::Machine::Group;
     5  use List::Util qw(sum);
     6  use Time::HiRes qw(time gettimeofday tv_interval);
     7  
     8  my @MACHINE_NAMES = split /:+/, ($ENV{MACHINES} || '');
     9  @MACHINE_NAMES = ('', '') unless @MACHINE_NAMES;
    10  
    11  my @m = map { GRID::Machine->new(host => $_, wait => 5, survive => 1) } @MACHINE_NAMES;
    12  
    13  my $c = GRID::Machine::Group->new(cluster => [ @m ]);
    14  
    15  $c->sub(suma_areas => q{
    16     my ($id, $N, $np) = @_;
    17       
    18     my $sum = 0;
    19     for (my $i = $id; $i < $N; $i += $np) {
    20         my $x = ($i + 0.5) / $N;
    21         $sum += 4 / (1 + $x * $x);
    22     }
    23     $sum /= $N; 
    24  });
    25  
    26  my ($N, $np, $pi)  = (1e7, 4, 0);
    27  
    28  my @args = map {  [$_, $N, $np] } 0..$np-1;
    29  
    30  my $t0 = [gettimeofday];
    31  $pi = sum($c->suma_areas(args => \@args)->Results);
    32  my $elapsed = tv_interval ($t0);
    33  print "Pi = $pi. N = $N Time = $elapsed\n";

When executed in one machine, it takes 5.387676 seconds:

  ~/grid-machine$ export MACHINES='local'
  ~/grid-machine$ perl  examples/pi.pl
  Pi = 3.14159265358978. N = 10000000 Time = 5.059222

When executed in a cluster with two nodes we get a speedup of 1.93 = 5.06/2.62:

  ~/grid-machine$ export MACHINES='local imac'
  ~/grid-machine$ perl  examples/pi.pl
  Pi = 3.14159265358978. N = 10000000 Time = 2.625359

When executed in a machine with two cores, it also has an almost linear speedup:

  ~/grid-machine$ unset MACHINES
  ~/grid-machine$ perl  examples/pi.pl
  Pi = 3.14159265358978. N = 10000000 Time = 2.685503

WORKING HARDER: COORDINATING A CLUSTER ^

The program gridpipes.pl following in the lines below runs $np copies of the former C program in a set @machines of available machines, adding up the partial results as soon as they arrive.

  pp2@nereida:~/LGRID_Machine/examples$ cat -n gridpipes.pl
     1  #!/usr/bin/perl
     2  use warnings;
     3  use strict;
     4  use IO::Select;
     5  use GRID::Machine;
     6  use Time::HiRes qw(time gettimeofday tv_interval);

The first lines load the modules:

     8  my @machine = qw{beowulf orion nereida};
     9  my $nummachines = @machine;
    10  my %machine; # Hash of GRID::Machine objects
    11  #my %debug = (beowulf => 12345, orion => 0, nereida => 0);
    12  my %debug = (beowulf => 0, orion => 0, nereida => 0);
    13
    14  my $np = shift || $nummachines; # number of processes
    15  my $lp = $np-1;
    16
    17  my $N = shift || 100;
    18
    19  my @pid;  # List of process pids
    20  my @proc; # List of handles
    21  my %id;   # Gives the ID for a given handle
    22
    23  my $cleanup = 0;
    24
    25  my $pi = 0;
    26
    27  my $readset = IO::Select->new();

Variable @machine stores the IP addresses/names of the machines we have SSH access. These machines will constitute our 'virtual' parallel machine. For each of these machines (see the for loop in lines 30-46) a SSH connection is created (line 31) via GRID::Machine->new. The resulting GRID::Machine objects will be stored inside the hash %machine (line 44).

    29  my $i = 0;
    30  for (@machine){
    31    my $m = GRID::Machine->new(host => $_, debug => $debug{$_}, );
    32
    33    $m->copyandmake(
    34      dir => 'pi',
    35      makeargs => 'pi',
    36      files => [ qw{pi.c Makefile} ],
    37      cleanfiles => $cleanup,
    38      cleandirs => $cleanup, # remove the whole directory at the end
    39      keepdir => 1
    40    ); 
    41    $m->chdir("pi/");
    42    die "Can't execute 'pi'\n" unless $m->_x("pi")->result;
    43
    44    $machine{$_} = $m;
    45    last unless $i++ < $np;
    46  }

The call to copyandmake at line 33 copies (using scp) the files pi.c and Makefile to a directory named pi in the remote machine. The directory pi will be created if it does not exists. After the file transfer the command specified by the copyandmake option

                     make => 'command' 

will be executed with the arguments specified in the option makeargs. If the make option isn't specified but there is a file named Makefile between the transferred files, the make program will be executed. Set the make option to number 0 or the string '' if you want to avoid the execution of any command after the transfer. The transferred files will be removed when the connection finishes if the option cleanfiles is set. More radical, the option cleandirs will remove the created directory and all the files below it. Observe that the directory and the files will be kept if they were'nt created by this connection. The call to copyandmake by default sets dir as the current directory in the remote machine. Use the option keepdir => 1 to one to avoid this.

After the involved files are transferred and executables have been built, the program proceeds to open $np processes. The call to open at line 52 executes the pi program in the remote machine $m. In a list context returns a handler - that can be used to read from the process - and the PID of the child process. The new handler $proc[$_] is added to the IO::Select object $readset at line 53. The hash %id stores the relation between handlers and logical process identifiers.

    48  my $t0 = [gettimeofday];
    49  for (0..$lp) {
    50    my $hn = $machine[$_ % $nummachines];
    51    my $m = $machine{$hn};
    52    ($proc[$_], $pid[$_]) = $m->open("./pi $_ $N $np |");
    53    $readset->add($proc[$_]);
    54    my $address = 0+$proc[$_];
    55    $id{$address} = $_;
    56  }

During the last stage the master node simply waits in the IO::Select object listening on each of the channels. As soon as a result is received it is added to the total sum for $pi:

    58  my @ready;
    59  my $count = 0;
    60  do {
    61    push @ready, $readset->can_read unless @ready;
    62    my $handle = shift @ready;
    63
    64    my $me = $id{0+$handle};
    65
    66    my ($partial);
    67    my $numBytesRead = sysread($handle,  $partial, 1024);
    68    chomp($partial);
    69
    70    $pi += $partial;
    71    print "Process $me: machine = $machine[$me % $nummachines] partial = $partial pi = $pi\n";
    72
    73    $readset->remove($handle) if eof($handle);
    74  } until (++$count == $np);
    75
    76  my $elapsed = tv_interval ($t0);
    77  print "Pi = $pi. N = $N Time = $elapsed\n";

PERFORMANCE: COMPUTATIONAL RESULTS ^

Let us see the time it takes the execution of the pure C program on each of the involved nodes (nereida, beowulf and orion). To have an idea of how things work for a comptuation large enough we set $N to 1 000 000 000 intervals:

    pp2@nereida:~/LGRID_Machine/examples$ time ssh nereida 'pi/pi 0 1000000000 1'
    3.141593

    real    0m32.534s
    user    0m0.036s
    sys     0m0.008s

    pp2@nereida:~/LGRID_Machine/examples$ time ssh beowulf 'pi/pi 0 1000000000 1'
    3.141593

    real    0m27.020s
    user    0m0.036s
    sys     0m0.008s

    casiano@beowulf:~$ time ssh orion 'pi/pi 0 1000000000 1'
    3.141593

    real    0m29.120s
    user    0m0.028s
    sys     0m0.003s

As you can see, there is some heterogeneity here. Machine nereida (my desktop) is slower than the others two. beowulf is the fastest.

Now let us run the parallel perl program in nereida using only the beowulf node. The time spent is roughly comparable to the pure C time. That is nice: The overhead introduced by the coordination tasks is not as large (compare it with the beowulf entry above):

    pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 1 1000000000
    Process 0: machine = beowulf partial = 3.141593 pi = 3.141593
    Pi = 3.141593. N = 1000000000 Time = 27.058693

    real    0m28.917s
    user    0m0.584s
    sys     0m0.192s

Now comes the true test: will it be faster using two nodes? how much?

    pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 2 1000000000
    Process 0: machine = beowulf partial = 1.570796 pi = 1.570796
    Process 1: machine = orion partial = 1.570796 pi = 3.141592
    Pi = 3.141592. N = 1000000000 Time = 15.094719

    real    0m17.684s
    user    0m0.904s
    sys     0m0.260s

We can see that the sequential pure C version took 32 seconds in my desktop (nereida). By using two machines I have SSH access I have reduced that time to roughly 18 seconds. This a factor of 32/18 = 1.8 times faster. This factor is even better if I don't consider the set-up time: 32/15 = 2.1. The total time decreases if I use the three machines:

    pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 3 1000000000
    Process 0: machine = beowulf partial = 1.047198 pi = 1.047198
    Process 1: machine = orion partial = 1.047198 pi = 2.094396
    Process 2: machine = nereida partial = 1.047198 pi = 3.141594
    Pi = 3.141594. N = 1000000000 Time = 10.971036

    real    0m13.700s
    user    0m0.952s
    sys     0m0.240s

which gives a speed factor of 32/13.7 = 2.3 or not considering the set-up time 32/10.9 = 2.9.

What happens if you have multiprocessor machine. The results highly depend on the underlying architecture. My machine nereida is a dual Xeon:

  nereida:/tmp/graphviz-2.20.2# cat /proc/cpuinfo
  processor       : 0
  vendor_id       : GenuineIntel
  cpu family      : 15
  model           : 2
  model name      : Intel(R) Xeon(TM) CPU 2.66GHz
  stepping        : 5
  cpu MHz         : 2658.041
  cache size      : 512 KB
  physical id     : 0
  .......................................

  processor       : 1
  vendor_id       : GenuineIntel
  cpu family      : 15
  model           : 2
  model name      : Intel(R) Xeon(TM) CPU 2.66GHz
  stepping        : 5
  cpu MHz         : 2658.041
  cache size      : 512 KB
  physical id     : 0
  ...................................

After changing the Makefile to include the -O3 option and the line defining the set of machines in gridpipes.pl (addresses in the subnetwork 127.0.0 are mapped to localhost):

  my @machine = qw{127.0.0.1 127.0.0.2 127.0.0.3 127.0.0.4};

We have the following results:

  pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 1 1000000000
  Process 0: machine = 127.0.0.1 partial = 3.141593 pi = 3.141593
  Pi = 3.141593. N = 1000000000 Time = 32.968117

  real    0m33.858s
  user    0m0.336s
  sys     0m0.128s

  pp2@nereida:~/LGRID_Machine/examples$ time gridpipes.pl 2 1000000000
  Process 1: machine = 127.0.0.2 partial = 1.570796 pi = 1.570796
  Process 0: machine = 127.0.0.1 partial = 1.570796 pi = 3.141592
  Pi = 3.141592. N = 1000000000 Time = 16.552487

  real    0m18.076s
  user    0m0.504s
  sys     0m0.188s

Which gives an speed up near 2.

CONCLUSIONS, LIMITATIONS AND FUTURE WORK ^

This example shows how to take advantage of the computational power of idle stations and the very high level of programming offered by Perl to improve the performance of an application. High Performance Programming (HPP, as provided by Very High Level Languages) and High Performance Computing (HPC) have been always two opposites ends of the spectrum: if you optimize programmer's time (HPP) computing time suffers (HPC) and viceversa. A synergetic combination of HPC and HPP tools can bring the best of both worlds.

This example however is too simplistic and does not address some important limitations that point to what must be done. I look forward for CPAN modules filling these gaps:

SEE ALSO ^

THE FULL CODE ^

The Driver: File gridpipes.pl

  $ cat gridpipes.pl
  #!/usr/bin/perl
  use warnings;
  use strict;
  use IO::Select;
  use GRID::Machine;
  use Time::HiRes qw(time gettimeofday tv_interval);

  my @machine = qw{beowulf orion nereida};
  my $nummachines = @machine;
  my %machine; # Hash of GRID::Machine objects
  #my %debug = (beowulf => 12345, orion => 0, nereida => 0);
  my %debug = (beowulf => 0, orion => 0, nereida => 0);

  my $np = shift || $nummachines; # number of processes
  my $lp = $np-1;

  my $N = shift || 100;

  my @pid;  # List of process pids
  my @proc; # List of handles
  my %id;   # Gives the ID for a given handle

  my $cleanup = 1;

  my $pi = 0;

  my $readset = IO::Select->new();

  my $i = 0;
  for (@machine){
    my $m = GRID::Machine->new(host => $_, debug => $debug{$_}, );

    $m->copyandmake(
      dir => 'pi',
      makeargs => 'pi',
      files => [ qw{pi.c Makefile} ],
      cleanfiles => $cleanup,
      cleandirs => $cleanup, # remove the whole directory at the end
      keepdir => 1,
    );

    $m->chdir("pi/");

    die "Can't execute 'pi'\n" unless $m->_x("pi")->result;

    $machine{$_} = $m;
    last unless $i++ < $np;
  }

  my $t0 = [gettimeofday];
  for (0..$lp) {
    my $hn = $machine[$_ % $nummachines];
    my $m = $machine{$hn};
    ($proc[$_], $pid[$_]) = $m->open("./pi $_ $N $np |");
    $readset->add($proc[$_]);
    my $address = 0+$proc[$_];
    $id{$address} = $_;
  }

  my @ready;
  my $count = 0;
  do {
    push @ready, $readset->can_read unless @ready;
    my $handle = shift @ready;

    my $me = $id{0+$handle};

    my ($partial);
    my $numBytesRead = sysread($handle,  $partial, 1024);
    chomp($partial);

    $pi += $partial;
    print "Process $me: machine = $machine[$me % $nummachines] partial = $partial pi = $pi\n";

    $readset->remove($handle) if eof($handle);
  } until (++$count == $np);

  my $elapsed = tv_interval ($t0);
  print "Pi = $pi. N = $N Time = $elapsed\n";

The Application. File pi.c

  $ cat pi.c
  #include <stdio.h>
  #include <stdlib.h>

  main(int argc, char **argv) {
    int id, N, np, i;
    double sum, left;

    if (argc != 4) {
      printf("Usage:\n%s id N np\n",argv[0]);
      exit(1);
    }
    id = atoi(argv[1]);
    N = atoi(argv[2]);
    np = atoi(argv[3]);
    for(i=id, sum = 0; i<N; i+=np) {
      double x = (i + 0.5)/N;
      sum += 4 / (1 + x*x);
    }
    sum /= N;
    fflush(stdout);
    printf("%lf\n", sum);
  }

Makefile

  $ cat Makefile
  pi:
          cc pi.c -o pi
syntax highlighting: