The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

###############################################################################
 # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################

 :: These examples demonstrate the various ways in parallelizing matrix
    multiplication with MCE. The examples make use of PDL::IO::FastRaw
    and PDL::Parallel::threads. PDL::Parallel::threads was created by
    David Mertens. PDL::IO::FastRaw comes pre-installed with PDL.

    One can diff the examples to see the comparison between using
    PDL::IO::FastRaw and PDL::Parallel::threads.

       diff  matmult_mce_f.pl  matmult_mce_t.pl
       diff  strassen_07_f.pl  strassen_07_t.pl
       diff  strassen_49_f.pl  strassen_49_t.pl

 :: All times below are reported in number of seconds.

    Examples will attempt to use /dev/shm under Linux for the raw MMAP
    files. Try the thread variant if /dev/shm is not writable for you.

    Not recommended is to run the *_[df].pl examples on an Oracle server
    making use of /dev/shm. Check the available size of /dev/shm before
    running.

    Benchmark results were captured from a 24-way and a 32-way server.


###############################################################################
 # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################

 -- Usage -------------------------------------------------------------------

 :: perl matmult.pl 1024 [ N_workers ]           ## Default matrix size 512
                                                 ## Default N_workers 8

    matmult_base.pl    PDL $c = $a x $b (1 worker)

    matmult_mce_d.pl   Uses MCE's do method to fetch (a) and store result (c)
                       Uses PDL::IO::FastRaw to read (b)

    matmult_mce_f.pl   MCE + PDL::IO::FastRaw

    matmult_mce_t.pl   MCE + PDL::Parallel::threads

    matmult_perl.pl    MCE + classic implementation in pure Perl

    matmult_simd.pl    Parallelization via PDL::Parallel::threads::SIMD

       Script was taken from https://gist.github.com/run4flat/4942132 for
       folks wanting to review, study, and compare with MCE. The script
       was modified to support the optional N_threads argument. Both the
       script and SIMD module were created by David Mertens.


 :: perl strassen.pl 1024                        ## Default matrix size 512


    strassen_07_f.pl   MCE divide-and-conquer 1 level
                       PDL::IO::FastRaw, 7 workers

    strassen_07_t.pl   MCE divide-and-conquer 1 level
                       PDL::Parallel::threads, 7 workers

    strassen_49_f.pl   MCE divide-and-conquer 2 levels submission using 1 MCE
                       PDL::IO::FastRaw, 49 workers

    strassen_49_t.pl   MCE divide-and-conquer 2 levels submission using 1 MCE
                       PDL::Parallel::threads, 49 workers

    strassen_perl.pl   MCE divide-and-conquer implementation in pure Perl
                       7 workers


 :: All examples ending in *_[df].pl spawn children via fork. Matmult_simd.pl
    and *_t.pl examples utilize threads otherwise.


###############################################################################
 # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################

 :: The matmult_simd.pl example divides work equally among all workers.
    It's possible that some workers will end up taking longer due to other
    jobs running on the system when dividing the work equally. This can have
    the effect of a longer ramp-down time from when the first and last
    workers complete processing.

    MCE, on the other hand, can break up work into smaller chunks. The chunk
    size is determined by the matrix size. The step_size, for the sequence
    option, is one way of enabling chunking in MCE.

       my $step_size = ($tam > 2048) ? 24 : ($tam > 1024) ? 16 : 8;

    Chunking combined with the bank-teller queuing model helps reduce the
    ramp-down time at the end of the job. Please note that matmult_mce_[df].pl
    spawns children whereas the other two spawns threads.

       matmult_mce_d 5120:  65.556s compute:  24 workers:  66.621s script
       matmult_mce_f 5120:  60.971s compute:  24 workers:  62.558s script
       matmult_mce_t 5120:  64.673s compute:  24 workers:  65.909s script
       matmult_simd  5120:  67.210s compute:  24 workers:  67.732s script

       Matmult_mce_d fetches the next chunk data and submits the result
       via the "do" method in MCE. Only the "b" matrix is read via
       PDL::IO::FastRaw among workers. I was not expecting this example
       to keep up actually.

    The bank-teller queuing model is now applied for the sequence option
    beginning with MCE 1.406.

       $mce->run(0, {
          sequence => [ 0, $rows - 1, $step_size ]
       } );

 :: The time measured for MCE is the actual compute time and does not
    include the time to spawn, whereas matmult_simd does. Please keep
    this in mind when comparing results. The script execution time is
    measured as well.

       e.g. time perl matmult_mce_f.pl 5120 24
    
    MCE has a one time cost for creating a pool of workers which can be
    recycled without having to spawn again. One could instantiate a MCE
    instance inside a perldl session and reuse the same instance multiple
    times. Hence, the one time cost diminishes with multiple use.

 :: The strassen examples apply the Strassen divide-and-conquer algorithm
    with modifications to recycle piddles and slices as much as possible
    to maximize memory utilization.

       http://en.wikipedia.org/wiki/Strassen_algorithm

    Two implementations are provided, each with PDL::IO::FastRaw and
    PDL::Parallel::threads (for sharing data).

       One level submission, 1 MCE instance, 7 workers
       Two level submission (all-at-once), 1 MCE instance, 49 workers

    Amazingly, the 49 workers implementation utilizing PDL::IO::FastRaw
    consumes slightly less memory than PDL::Parallel::threads.


###############################################################################
 # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################

 :: This system contains 2x Intel E5649 processors with 32GB 1066 MHz RAM.
    The OS is RHEL 6.3, Perl 5.10.1, perl-PDL-2.4.7-1 installed via yum.

 -- Results for 1024x1024 on a 24-way, 32GB box ------------------------------

 matmult_base:    2.686s compute:   1 worker:    2.894s script running time

 matmult_mce_d:   0.545s compute:  24 workers:   0.852s script
 matmult_mce_f:   0.479s compute:  24 workers:   0.824s script
 matmult_mce_t:   0.510s compute:  24 workers:   1.473s script (interesting)
 matmult_simd:    0.780s compute:  24 workers:   1.065s script

 strassen_07_f:   0.385s compute:   7 workers:   0.665s script
 strassen_07_t:   0.397s compute:   7 workers:   0.992s script
 strassen_49_f:   0.326s compute:  49 workers:   0.617s script (very nice)
 strassen_49_t:   0.353s compute:  49 workers:   1.969s script

 matmult_perl:   23.471s compute:  24 workers:  24.175s script
 strassen_perl:  44.685s compute:   7 workers:  45.119s script

 Output
    (0,0) 365967179776  (1023,1023) 563314846859776


 -- Results for 2048x2048 on a 24-way, 32GB box ------------------------------

 matmult_base:   21.521s compute:   1 worker:   21.783s script:   0.3% memory

 matmult_mce_d:   4.206s compute:  24 workers:   4.528s script:   3.5% memory
 matmult_mce_f:   3.483s compute:  24 workers:   4.017s script:   3.1% memory
 matmult_mce_t:   4.113s compute:  24 workers:   5.191s script:   0.9% memory
 matmult_simd:    4.617s compute:  24 workers:   4.901s script:   0.8% memory

 strassen_07_f:   1.951s compute:   7 workers:   2.249s script:   1.5% memory
 strassen_07_t:   1.934s compute:   7 workers:   2.576s script:   1.4% memory
 strassen_49_f:   1.486s compute:  49 workers:   1.823s script:   2.5% memory
 strassen_49_t:   1.494s compute:  49 workers:   3.192s script:   2.7% memory

 matmult_perl:  185.343s compute:  24 workers: 187.698s script:   9.7% memory
 strassen_perl: 319.708s compute:   7 workers: 320.969s script:   8.6% memory

 Output
    (0,0) 5859767746560  (2047,2047) 1.80202496872953e+16  matmul examples
    (0,0) 5859767746560  (2047,2047) 1.8020249687295e+16   strassen examples


 -- Results for 4096x4096 on a 24-way, 32GB box ------------------------------

 matmult_base:  172.145s compute:   1 worker:  172.145s script:   1.2% memory

 matmult_mce_d:  34.954s compute:  24 workers:  35.717s script:  12.0% memory
 matmult_mce_f:  36.457s compute:  24 workers:  37.336s script:  10.8% memory
 matmult_mce_t:  32.565s compute:  24 workers:  33.723s script:   1.8% memory
 matmult_simd:   34.161s compute:  24 workers:  34.614s script:   2.0% memory

 strassen_07_f:  12.701s compute:   7 workers:  13.186s script:   5.5% memory
 strassen_07_t:  12.964s compute:   7 workers:  13.671s script:   4.8% memory
 strassen_49_f:   8.867s compute:  49 workers:   9.338s script:   7.4% memory
 strassen_49_t:   8.918s compute:  49 workers:  10.761s script:   7.9% memory

 Output
    (0,0) 93790635294720  (4095,4095) 5.76554474219245e+17  matmul examples
    (0,0) 93790635294720  (4095,4095) 5.76554474219244e+17  strassen example


 -- Results for 5120x5120 on a 24-way, 32GB box ------------------------------

 matmult_base:  336.464s compute:   1 worker:  336.867s script:   1.9% memory

 matmult_mce_d:  65.556s compute:  24 workers:  66.621s script:  18.2% memory
 matmult_mce_f:  60.971s compute:  24 workers:  62.558s script:  16.1% memory
 matmult_mce_t:  64.673s compute:  24 workers:  65.909s script:   2.5% memory
 matmult_simd:   67.210s compute:  24 workers:  67.732s script:   2.9% memory

 Output
    (0,0) 228997817958400  (5119,5119) 1.75944746804184e+18

 Observation
    The difference with memory utilization between matmult_mce_t and
    matmult_simd is largely due to MCE working on smaller chunks (24 rows).
    MCE would also utilize 2.9% had work been divided equally among the
    number of workers (214 rows -- one chunk).


 -- Results for 8192x8192 on a 24-way, 32GB box ------------------------------

 matmult_base: 1388.241s compute:   1 worker: 1388.888s script:   4.8% memory

 strassen_07_f:  83.562s compute:   7 workers:  84.397s script:  21.3% memory
 strassen_07_t:  85.366s compute:   7 workers:  86.559s script:  18.8% memory
 strassen_49_f:  57.019s compute:  49 workers:  57.893s script:  29.5% memory
 strassen_49_t:  64.476s compute:  49 workers:  66.776s script:  29.9% memory

 Output
    (0,0) 1.50092500906803e+15  (8191,8191) 1.84482444489628e+19

 Observation
    Strassen_49_f consumes less memory than strassen_49_t including having
    faster processing time. This is my favorite of the bunch. The same can
    be seen on the 32-way box.

    Compare compute time with script time. There's a larger gap when using
    threads (*_t) versus forking (*_f).


###############################################################################
 # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################

 :: This system contains 2x Intel E5-2660 processors with 128GB 1600 MHz RAM.
    The OS is CentOS 6.3, Perl 5.10.1, perl-PDL-2.4.7-1 installed via yum.

 :: Performance mode was enabled prior to testing. This has little effect on
    larger matrix sizes though, but is helpful for 1024x1024 results.
    The following was run as root to disable CPU frequency scaling.

       # /etc/init.d/cpuspeed stop
       Disabling ondemand cpu frequency scaling:                  [ OK ]


 -- Results for 1024x1024 on a 32-way, 128GB box -----------------------------

 matmult_base:    2.230s compute:   1 worker:    2.361s script running time

 matmult_mce_d:   0.196s compute:  32 workers:   0.386s script
 matmult_mce_f:   0.179s compute:  32 workers:   0.398s script
 matmult_mce_t:   0.182s compute:  32 workers:   0.964s script (interesting)
 matmult_simd:    0.536s compute:  32 workers:   0.696s script

 strassen_07_f:   0.232s compute:   7 workers:   0.402s script
 strassen_07_t:   0.221s compute:   7 workers:   0.544s script
 strassen_49_f:   0.199s compute:  49 workers:   0.386s script (very nice)
 strassen_49_t:   0.206s compute:  49 workers:   1.314s script

 matmult_perl:   15.105s compute:  32 workers:  15.634s script
 strassen_perl:  38.764s compute:   7 workers:  39.063s script

 Output
    (0,0) 365967179776  (1023,1023) 563314846859776


 -- Results for 5120x5120 on a 32-way, 128GB box -----------------------------

 matmult_base:  282.508s compute:   1 worker:  282.776s script:   0.5% memory

 matmult_mce_d:  21.706s compute:  32 workers:  22.744s script:   5.9% memory
 matmult_mce_f:  20.665s compute:  32 workers:  22.235s script:   5.4% memory
 matmult_mce_t:  21.212s compute:  32 workers:  22.163s script:   0.7% memory
 matmult_simd:   22.056s compute:  32 workers:  22.407s script:   0.8% memory

 Output
    (0,0) 228997817958400  (5119,5119) 1.75944746804184e+18


 -- Results for 6144x6144 on a 32-way, 128GB box -----------------------------

 matmult_base:  486.967s compute:   1 worker:  487.299s script:   0.7% memory

 matmult_mce_d:  36.383s compute:  32 workers:  37.065s script:   8.3% memory
 matmult_mce_f:  35.898s compute:  32 workers:  38.072s script:   7.8% memory
 matmult_mce_t:  35.850s compute:  32 workers:  36.893s script:   0.9% memory
 matmult_simd:   37.121s compute:  32 workers:  37.561s script:   1.0% memory

 Output
    (0,0) 474873065373696  (6143,6143) 4.37797347894126e+18


 -- Results for 7168x7168 on a 32-way, 128GB box -----------------------------

 matmult_base:  779.615s compute:   1 worker:  780.024s script:   0.9% memory

 matmult_mce_d:  57.781s compute:  32 workers:  59.628s script:  11.2% memory
 matmult_mce_f:  56.083s compute:  32 workers:  58.997s script:  10.2% memory
 matmult_mce_t:  54.158s compute:  32 workers:  55.318s script:   1.1% memory
 matmult_simd:   59.207s compute:  32 workers:  59.755s script:   1.4% memory

 Output
    (0,0) 879791667937280  (7167,7167) 9.46237929052649e+18

 Observation
    The same pattern is seen when comparing memory utilization between
    matmult_mce_t and matmult_simd. The difference is 384MB for a matrix
    size of 7168. MCE chunks away at 24 rows per each chunk, not 224 rows.


 -- Results for 8192x8192 on a 32-way, 128GB box -----------------------------

 matmult_base: 1161.419s compute:   1 worker: 1161.883s script:   1.2% memory

 matmult_mce_d: 316.989s compute:  32 workers: 319.482s script:  14.5% memory
 matmult_mce_f: 317.054s compute:  32 workers: 320.737s script:  13.1% memory
 matmult_mce_t:  83.002s compute:  32 workers:  84.255s script:   1.4% memory
 matmult_simd:   87.355s compute:  32 workers:  88.019s script:   1.7% memory

 strassen_07_f:  59.998s compute:   7 workers:  60.670s script:   5.2% memory
 strassen_07_t:  62.259s compute:   7 workers:  63.097s script:   4.7% memory
 strassen_49_f:  29.972s compute:  49 workers:  30.663s script:   7.3% memory
 strassen_49_t:  36.392s compute:  49 workers:  38.075s script:   7.4% memory

 Output
    (0,0) 1.50092500906803e+15  (8191,8191) 1.84482444489628e+19

 Observation
    The strassen results are quite fast and breaking 30 seconds for compute
    time -- even with strassen_49_f.

    However, something seems wrong when processing a matrix size of 8192 using
    PDL::IO::FastRaw. Both matmult_mce_d and matmult_mce_f are seen taking
    considerable amount of time over matmult_mce_t and matmult_simd. This
    pattern was not present when processing a matrix size of 7168 earlier.
    Repeated attempts show the same behavior.

    Making a copy of the "b" matrix solves the problem, but that requires
    an additional 16.8 GB of memory to run for 32 workers.

       $self->{r} = mapfraw("$tmp_dir/b")->copy;

    Readfraw also solves the problem. Although, that wants a whopping 50.4 GB
    of additional memory which seems quite high.

       $self->{r} = readfraw("$tmp_dir/b");


###############################################################################
 # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################

 The Strassen algorithm can introduce rounding errors noted above in the
 output. Most often, it may not be a problem.

 -- Mario