The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Runops::Optimized design

DESCRIPTION

Runops::Optimized unrolls the optree of a Perl subroutine in execution order, so that the CPU has a better chance of branch prediction and improved cache usage.

It takes a minimal approach to this and aims to simply return to a variant of the normal perl runloop if an op is seen that will have unpredictable results.

Eventually some small hot ops such as pp_nextstate, pp_const, etc may be inlined.

Some people may call this JIT but I'm of the opinion that until it actually has a closer understanding of what the underlying ops are doing it is just unrolling.

COMPONENTS

  • sljit

    Sljit is used to actually generate the underlying machine code, this handles support for the most common CPUs and means the code isn't tied to a particular machine. It is considerably simpler than LLVM and can be shipped with this module as it is small.

    Sljit is stackless, so it doesn't make use of the normal C level stack (in the normal way anyway), this is what makes it possible to safely return to the interpreter at any point. This makes dealing with edge cases easy.

  • Inserting code

    This is one slightly evil area. Each CV is unrolled on the second time it is executed. The idea for waiting until the second time is unrolling certain setup subroutines would be of limited value.

    This is recorded in the bits known as op_spare and the result of unrolling is patched straight into op_ppcode. Obviously this isn't ideal and eventually this may be stored in structure separate to the optree (potentially with a lock for threaded support).

ISSUES / TODO

This is only a proof of concept really, so there's many issues.

  • Test other CPUs

    I've only tested this on x86_64 on OS X. This should work on anything sljit supports but needs testing.

  • Better code for following execution order

    The code for following execution order is lame (see comment in unroll.c). It can even get stuck in a loop on some branches.

  • Unroll flow-control ops

    last, next, etc. result in a return. These should be supported, but are quite complex. (next should be fairly easy though.)

  • No-multiplicity support

    This only works for a non-multiplicity, non-threaded build of perl. Neither would be impossible to support, but are more work.

  • More tests, etc

    This has only received limited testing, it probably misses even important core perl ops.

    Probably worth having author tests, e.g. export PERL5OPT=-mRunops::Optimized and then run some large modules test suites.

  • Custom ops

    Custom ops and things that do unexpected things may present issues. Some of this is mitigated by doing the unrolling at run time, so any compile time modifications to the op tree will be picked up.

  • Inlining hot ops

    For more speed it would be interesting

  • Investigate memory/CPU tradeoff

    How much overhead does unrolling everything have for large programs?

      $ PERL5LIB= /usr/bin/time bleadperl -MRunops::Optimized -MMoose -e1 
            0.87 real         0.81 user         0.03 sys
      $ PERL5LIB= /usr/bin/time bleadperl -MMoose -e1                    
            0.76 real         0.72 user         0.02 sys

DEBUGGING

This will break. You'll need to debug it.

First of all compile with debugging support:

  perl Makefile.PL DEBUG=1

This does two things, enable an environment variable that prints out the inner workings when it is set:

  export RUNOPS_OPTIMIZED_DEBUG=

Additionally it generates trap instructions (int3 on IA32) that run when PL_op isn't in the expected place.