###############################################################################
# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################
:: These examples demonstrate the various ways in parallelizing matrix
multiplication with MCE. The examples make use of PDL::IO::FastRaw
and PDL::Parallel::threads. PDL::Parallel::threads was created by
David Mertens. PDL::IO::FastRaw comes pre-installed with PDL.
One can diff the examples to see the comparison between using
PDL::IO::FastRaw and PDL::Parallel::threads.
diff matmult_mce_f.pl matmult_mce_t.pl
diff strassen_07_f.pl strassen_07_t.pl
diff strassen_49_f.pl strassen_49_t.pl
:: All times below are reported in number of seconds.
Examples will attempt to use /dev/shm under Linux for the raw MMAP
files. Try the thread variant if /dev/shm is not writable for you.
Not recommended is to run the *_[df].pl examples on an Oracle server
making use of /dev/shm. Check the available size of /dev/shm before
running.
Benchmark results were captured from a 24-way and a 32-way server.
###############################################################################
# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################
-- Usage -------------------------------------------------------------------
:: perl matmult.pl 1024 [ n_workers ] ## Default matrix size 512
## Default n_workers 8
matmult_base.pl PDL $c = $a x $b (1 worker)
matmult_mce_d.pl Uses MCE's do method to fetch (a) and store result (c)
Uses PDL::IO::FastRaw to read (b)
matmult_mce_f.pl MCE + PDL::IO::FastRaw
matmult_mce_t.pl MCE + PDL::Parallel::threads
matmult_perl.pl MCE + classic implementation in pure Perl
matmult_simd.pl Parallelization via PDL::Parallel::threads::SIMD
Script was taken from https://gist.github.com/run4flat/4942132 for
folks wanting to review, study, and compare with MCE. The script
was modified to support the optional N_threads argument. Both the
script and SIMD module were created by David Mertens.
:: perl strassen.pl 1024 ## Default matrix size 512
strassen_07_f.pl MCE divide-and-conquer 1 level
PDL::IO::FastRaw, 7 workers
strassen_07_t.pl MCE divide-and-conquer 1 level
PDL::Parallel::threads, 7 workers
strassen_49_f.pl MCE divide-and-conquer 2 levels submission using 1 MCE
PDL::IO::FastRaw, 49 workers
strassen_49_t.pl MCE divide-and-conquer 2 levels submission using 1 MCE
PDL::Parallel::threads, 49 workers
strassen_perl.pl MCE divide-and-conquer implementation in pure Perl
7 workers
:: All examples ending in *_[df].pl spawn children via fork. Matmult_simd.pl
and *_t.pl examples utilize threads otherwise.
###############################################################################
# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################
:: The matmult_simd.pl example divides work equally among all workers.
It's possible that some workers will end up taking longer due to other
jobs running on the system when dividing the work equally. This can have
the effect of a longer ramp-down time from when the first and last
workers complete processing.
MCE, on the other hand, can break up work into smaller chunks. The chunk
size is determined by the matrix size. The step_size, for the sequence
option, is one way of enabling chunking in MCE.
my $step_size = ($tam > 2048) ? 24 : ($tam > 1024) ? 16 : 8;
Chunking combined with the bank-teller queuing model helps reduce the
ramp-down time at the end of the job. Please note that matmult_mce_[df].pl
spawns children whereas the other two spawns threads.
matmult_mce_d 5120: 65.556s compute: 24 workers: 66.621s script
matmult_mce_f 5120: 60.971s compute: 24 workers: 62.558s script
matmult_mce_t 5120: 64.673s compute: 24 workers: 65.909s script
matmult_simd 5120: 67.210s compute: 24 workers: 67.732s script
Matmult_mce_d fetches the next chunk data and submits the result
via the "do" method in MCE. Only the "b" matrix is read via
PDL::IO::FastRaw among workers. I was not expecting this example
to keep up actually.
The bank-teller queuing model is now applied for the sequence option
beginning with MCE 1.406.
$mce->run(0, {
sequence => [ 0, $rows - 1, $step_size ]
} );
:: The time measured for MCE is the actual compute time and does not
include the time to spawn, whereas matmult_simd does. Please keep
this in mind when comparing results. The script execution time is
measured as well.
e.g. time perl matmult_mce_f.pl 5120 24
MCE has a one time cost for creating a pool of workers which can be
recycled without having to spawn again. One could instantiate a MCE
instance inside a perldl session and reuse the same instance multiple
times. Hence, the one time cost diminishes with multiple use.
:: The strassen examples apply the Strassen divide-and-conquer algorithm
with modifications to recycle piddles and slices as much as possible
to maximize memory utilization.
http://en.wikipedia.org/wiki/Strassen_algorithm
Two implementations are provided, each with PDL::IO::FastRaw and
PDL::Parallel::threads (for sharing data).
One level submission, 1 MCE instance, 7 workers
Two level submission (all-at-once), 1 MCE instance, 49 workers
Amazingly, the 49 workers implementation utilizing PDL::IO::FastRaw
consumes slightly less memory than PDL::Parallel::threads.
###############################################################################
# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################
:: This system contains 2x Intel E5649 processors with 32GB 1066 MHz RAM.
The OS is RHEL 6.3, Perl 5.10.1, perl-PDL-2.4.7-1 installed via yum.
-- Results for 1024x1024 on a 24-way, 32GB box ------------------------------
matmult_base: 2.686s compute: 1 worker: 2.894s script running time
matmult_mce_d: 0.545s compute: 24 workers: 0.852s script
matmult_mce_f: 0.479s compute: 24 workers: 0.824s script
matmult_mce_t: 0.510s compute: 24 workers: 1.473s script (interesting)
matmult_simd: 0.780s compute: 24 workers: 1.065s script
strassen_07_f: 0.385s compute: 7 workers: 0.665s script
strassen_07_t: 0.397s compute: 7 workers: 0.992s script
strassen_49_f: 0.326s compute: 49 workers: 0.617s script (very nice)
strassen_49_t: 0.353s compute: 49 workers: 1.969s script
matmult_perl: 23.471s compute: 24 workers: 24.175s script
strassen_perl: 44.685s compute: 7 workers: 45.119s script
Output
(0,0) 365967179776 (1023,1023) 563314846859776
-- Results for 2048x2048 on a 24-way, 32GB box ------------------------------
matmult_base: 21.521s compute: 1 worker: 21.783s script: 0.3% memory
matmult_mce_d: 4.206s compute: 24 workers: 4.528s script: 3.5% memory
matmult_mce_f: 3.483s compute: 24 workers: 4.017s script: 3.1% memory
matmult_mce_t: 4.113s compute: 24 workers: 5.191s script: 0.9% memory
matmult_simd: 4.617s compute: 24 workers: 4.901s script: 0.8% memory
strassen_07_f: 1.951s compute: 7 workers: 2.249s script: 1.5% memory
strassen_07_t: 1.934s compute: 7 workers: 2.576s script: 1.4% memory
strassen_49_f: 1.486s compute: 49 workers: 1.823s script: 2.5% memory
strassen_49_t: 1.494s compute: 49 workers: 3.192s script: 2.7% memory
matmult_perl: 185.343s compute: 24 workers: 187.698s script: 9.7% memory
strassen_perl: 319.708s compute: 7 workers: 320.969s script: 8.6% memory
Output
(0,0) 5859767746560 (2047,2047) 1.80202496872953e+16 matmul examples
(0,0) 5859767746560 (2047,2047) 1.8020249687295e+16 strassen examples
-- Results for 4096x4096 on a 24-way, 32GB box ------------------------------
matmult_base: 172.145s compute: 1 worker: 172.145s script: 1.2% memory
matmult_mce_d: 34.954s compute: 24 workers: 35.717s script: 12.0% memory
matmult_mce_f: 36.457s compute: 24 workers: 37.336s script: 10.8% memory
matmult_mce_t: 32.565s compute: 24 workers: 33.723s script: 1.8% memory
matmult_simd: 34.161s compute: 24 workers: 34.614s script: 2.0% memory
strassen_07_f: 12.701s compute: 7 workers: 13.186s script: 5.5% memory
strassen_07_t: 12.964s compute: 7 workers: 13.671s script: 4.8% memory
strassen_49_f: 8.867s compute: 49 workers: 9.338s script: 7.4% memory
strassen_49_t: 8.918s compute: 49 workers: 10.761s script: 7.9% memory
Output
(0,0) 93790635294720 (4095,4095) 5.76554474219245e+17 matmul examples
(0,0) 93790635294720 (4095,4095) 5.76554474219244e+17 strassen example
-- Results for 5120x5120 on a 24-way, 32GB box ------------------------------
matmult_base: 336.464s compute: 1 worker: 336.867s script: 1.9% memory
matmult_mce_d: 65.556s compute: 24 workers: 66.621s script: 18.2% memory
matmult_mce_f: 60.971s compute: 24 workers: 62.558s script: 16.1% memory
matmult_mce_t: 64.673s compute: 24 workers: 65.909s script: 2.5% memory
matmult_simd: 67.210s compute: 24 workers: 67.732s script: 2.9% memory
Output
(0,0) 228997817958400 (5119,5119) 1.75944746804184e+18
Observation
The difference with memory utilization between matmult_mce_t and
matmult_simd is largely due to MCE working on smaller chunks (24 rows).
MCE would also utilize 2.9% had work been divided equally among the
number of workers (214 rows -- one chunk).
-- Results for 8192x8192 on a 24-way, 32GB box ------------------------------
matmult_base: 1388.241s compute: 1 worker: 1388.888s script: 4.8% memory
strassen_07_f: 83.562s compute: 7 workers: 84.397s script: 21.3% memory
strassen_07_t: 85.366s compute: 7 workers: 86.559s script: 18.8% memory
strassen_49_f: 57.019s compute: 49 workers: 57.893s script: 29.5% memory
strassen_49_t: 64.476s compute: 49 workers: 66.776s script: 29.9% memory
Output
(0,0) 1.50092500906803e+15 (8191,8191) 1.84482444489628e+19
Observation
Strassen_49_f consumes less memory than strassen_49_t including having
faster processing time. This is my favorite of the bunch. The same can
be seen on the 32-way box.
Compare compute time with script time. There's a larger gap when using
threads (*_t) versus forking (*_f).
###############################################################################
# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################
:: This system contains 2x Intel E5-2660 processors with 128GB 1600 MHz RAM.
The OS is CentOS 6.3, Perl 5.10.1, perl-PDL-2.4.7-1 installed via yum.
:: Performance mode was enabled prior to testing. This has little effect on
larger matrix sizes though, but is helpful for 1024x1024 results.
The following was run as root to disable CPU frequency scaling.
# /etc/init.d/cpuspeed stop
Disabling ondemand cpu frequency scaling: [ OK ]
-- Results for 1024x1024 on a 32-way, 128GB box -----------------------------
matmult_base: 2.230s compute: 1 worker: 2.361s script running time
matmult_mce_d: 0.196s compute: 32 workers: 0.386s script
matmult_mce_f: 0.179s compute: 32 workers: 0.398s script
matmult_mce_t: 0.182s compute: 32 workers: 0.964s script (interesting)
matmult_simd: 0.536s compute: 32 workers: 0.696s script
strassen_07_f: 0.232s compute: 7 workers: 0.402s script
strassen_07_t: 0.221s compute: 7 workers: 0.544s script
strassen_49_f: 0.199s compute: 49 workers: 0.386s script (very nice)
strassen_49_t: 0.206s compute: 49 workers: 1.314s script
matmult_perl: 15.105s compute: 32 workers: 15.634s script
strassen_perl: 38.764s compute: 7 workers: 39.063s script
Output
(0,0) 365967179776 (1023,1023) 563314846859776
-- Results for 5120x5120 on a 32-way, 128GB box -----------------------------
matmult_base: 282.508s compute: 1 worker: 282.776s script: 0.5% memory
matmult_mce_d: 21.706s compute: 32 workers: 22.744s script: 5.9% memory
matmult_mce_f: 20.665s compute: 32 workers: 22.235s script: 5.4% memory
matmult_mce_t: 21.212s compute: 32 workers: 22.163s script: 0.7% memory
matmult_simd: 22.056s compute: 32 workers: 22.407s script: 0.8% memory
Output
(0,0) 228997817958400 (5119,5119) 1.75944746804184e+18
-- Results for 6144x6144 on a 32-way, 128GB box -----------------------------
matmult_base: 486.967s compute: 1 worker: 487.299s script: 0.7% memory
matmult_mce_d: 36.383s compute: 32 workers: 37.065s script: 8.3% memory
matmult_mce_f: 35.898s compute: 32 workers: 38.072s script: 7.8% memory
matmult_mce_t: 35.850s compute: 32 workers: 36.893s script: 0.9% memory
matmult_simd: 37.121s compute: 32 workers: 37.561s script: 1.0% memory
Output
(0,0) 474873065373696 (6143,6143) 4.37797347894126e+18
-- Results for 7168x7168 on a 32-way, 128GB box -----------------------------
matmult_base: 779.615s compute: 1 worker: 780.024s script: 0.9% memory
matmult_mce_d: 57.781s compute: 32 workers: 59.628s script: 11.2% memory
matmult_mce_f: 56.083s compute: 32 workers: 58.997s script: 10.2% memory
matmult_mce_t: 54.158s compute: 32 workers: 55.318s script: 1.1% memory
matmult_simd: 59.207s compute: 32 workers: 59.755s script: 1.4% memory
Output
(0,0) 879791667937280 (7167,7167) 9.46237929052649e+18
Observation
The same pattern is seen when comparing memory utilization between
matmult_mce_t and matmult_simd. The difference is 384MB for a matrix
size of 7168. MCE chunks away at 24 rows per each chunk, not 224 rows.
-- Results for 8192x8192 on a 32-way, 128GB box -----------------------------
matmult_base: 1161.419s compute: 1 worker: 1161.883s script: 1.2% memory
matmult_mce_d: 316.989s compute: 32 workers: 319.482s script: 14.5% memory
matmult_mce_f: 317.054s compute: 32 workers: 320.737s script: 13.1% memory
matmult_mce_t: 83.002s compute: 32 workers: 84.255s script: 1.4% memory
matmult_simd: 87.355s compute: 32 workers: 88.019s script: 1.7% memory
strassen_07_f: 59.998s compute: 7 workers: 60.670s script: 5.2% memory
strassen_07_t: 62.259s compute: 7 workers: 63.097s script: 4.7% memory
strassen_49_f: 29.972s compute: 49 workers: 30.663s script: 7.3% memory
strassen_49_t: 36.392s compute: 49 workers: 38.075s script: 7.4% memory
Output
(0,0) 1.50092500906803e+15 (8191,8191) 1.84482444489628e+19
Observation
The strassen results are quite fast and breaking 30 seconds for compute
time -- even with strassen_49_f.
However, something seems wrong when processing a matrix size of 8192 using
PDL::IO::FastRaw. Both matmult_mce_d and matmult_mce_f are seen taking
considerable amount of time over matmult_mce_t and matmult_simd. This
pattern was not present when processing a matrix size of 7168 earlier.
Repeated attempts show the same behavior.
Making a copy of the "b" matrix solves the problem, but that requires
an additional 16.8 GB of memory to run for 32 workers.
$self->{r} = mapfraw("$tmp_dir/b")->copy;
Readfraw also solves the problem. Although, that wants a whopping 50.4 GB
of additional memory which seems quite high.
$self->{r} = readfraw("$tmp_dir/b");
###############################################################################
# * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * # * #
###############################################################################
The Strassen algorithm can introduce rounding errors noted above in the
output. Most often, it may not be a problem.
-- Mario