Index: docs/IPRA.rst =================================================================== --- /dev/null +++ docs/IPRA.rst @@ -0,0 +1,289 @@ +=========================================== +Interprocedural Register Allocation in LLVM +=========================================== + +.. contents:: + :local: + +Introduction +============ +This document is meant to be a reference for people who wants to understand how +interprocedural register allocation (from now IPRA) is implemented in LLVM. +This tutorial will provide theoretical background for IPRA and how it is +implemented inside LLVM and also explain effect by walking through some +examples. It is assumed that reader is already familiar with LLVM pass +infrastructure and code generation. The basic concepts or intraprocedural +register allocation, calling conventions and X86 assembly is also required to +understand the examples provided. + +Background +========== +In literature there are few techniques to achieve interprocedural register +allocation. Mainly these techniques are divided into link-time and compile-time +optimization. We have implemented a compile-time techniques which is very +similar to techniques mentioned in `Steen 87`_, `Chow 88`_. In this technique +IPRA tries to improve register allocation by propagating register usage +information through the program call graph as each procedure is compiled. + +The procedures are compiled in a bottom-up order of the call graph, beginning +with leaf nodes. So register usage information can be propagated to caller for +callee. LLVM’s intra-procedural register allocators can use this information at +each call site and can avoid assigning register already used in the callee +functions, as a result it minimizes spill code. +In general this results in improvement but this may be less effective for the +programs whose execution is concentrated in the upper regions of the call graph. + +Implementation +============== +We have added an option **-enable-ipra** to control this optimization. It can +be used as ``-enable-ipra`` with ``llc``, ``-mllvm -enable-ipra`` with ``clang`` +and ``-Wl,-mllvm,-enable-ipra`` when LTO is enabled. To print clobbered +registers information use **-print-regusage**. +The the code generation phase in LLVM operates on a MachineInstr (MI) which is +an intermediate representation which is very near to machine assembly. In MI a +call instruction is provided with a register mask (regmask) as one of its +operand. + +For example in MI a call instruction on X86 architecture looks like following: + +:: + + CALL64pcrel32 , , %RSP, %EDI, %RSP + + +The regmask is generated based on the calling convention (CC) for the given +function. The calling convention describes the contract between callee and +caller functions so that both of them can user registers such that code does not +break on function call and return. So the regmask describes the registers which +are guaranteed to be preserved across call. So in the above example register +allocator can deduce that decode function will preserve ``RBX``, ``R13`` etc +register. How ever this information is based on static decision made by +following a specific CC, but it may be the case that function decode is not +using even more registers for example ``R8``, ``R9`` etc so having that +information in above register mask will help register allocator by providing +more free registers to use and this is what actually interprocedural register +allocator tries to do. + +To determine actual register usage it is required that the procedure is already +codegen and register allocation phase has been executed. This also implies that +before the codegen of every caller, the callee should be codegen and this is why +IPRA requires codegen to be followed on bottom-up order of call graph. So total +4 passes are added to LLVM : + +#. DummyCGSCCPass - To change code generation order +#. PhysicalRegisterUsageInfo - An immutable pass to store register usage details +#. RegUsageInfoCollector - To collect actual register usage post register allocation +#. RegUsageInfoPropagationPass - To propagate register usage info for call sites Each of this pass is explained in details below. + + +Change CodeGen Order +-------------------- +Other interprocedural optimization also requires codegen order to be bottom up +on call graph. LLVM provides CallGraphSCC (strongly connected component) pass +for this purpose. We added a DummyCGSCCPass which actually does not perform any +thing but FunctionPassManager will be wrapped inside CallGraphSCCPassManager. +TargetPassConfig configures target independent passes to be used by LLVM backend +. In presence of EnableIPRA target option we added CGSCCPass to pass manager. + +Store Register Usage Details +---------------------------- +The scope of current implementation is limited to a module. To be able to +propagate register usage info from callee to caller we need to manage details +till all the functions defined in the module is compiled. To do this we have +used an immutable pass, it does not required to implement run method as +immutable pass is never run. However immutable pass is not destroyed until whole +module is compiled. This pass holds a map of Function * to reg mask. The +collection pass will put entries into the map and propagation pass will query +the result at call sites. The bit mask value 0 means register will be clobbered +by the function and 1 means register will be preserver across the function call. + +Register Usage Collection +------------------------- +The RegUsageInfoCollector pass will run post register allocation and it will a +build a regmask for given function and store it in the PhysicalRegisterUsageInfo +. So it is implied that PhysicalRegisterUsageInfo is required to be initialized +before collection pass. I would like to mention few important points in the +code. + +The regmask is initialized with all bits set and each bit represents a physical +register on the given target. + +.. code-block:: c++ + + std::vector RegMask; + // Compute the size of the bit vector to represent all the registers. + // The bit vector is broken into 32-bit chunks, thus takes the ceil of + // the number of registers divided by 32 for the size. + unsigned RegMaskSize = (TRI->getNumRegs() + 31) / 32; + RegMask.resize(RegMaskSize, 0xFFFFFFFF); + +The following loop is crux of the register collection pass. It iterates through +all physical registers on given target and check if the given register is really +used by the function. If register is used then mark it as clobbered. + +.. code-block:: c++ + + for (unsigned PReg = 1, PRegE = TRI->getNumRegs(); PReg < PRegE; ++PReg) + if (MRI->isPhysRegModified(PReg, true)) + RegMask[PReg / 32] &= ~(1u << PReg % 32); + +Time complexity of above for loop is O(# registers on target) including alias +registers on X86 like architecture. How ever this loop does not consider calling +convention for the function. So we need to set bits for registers which are +preserved due to calling convention as shown in code below. And finally the +register mask is stored in PhysicalRegisterUsageInfo. + +.. code-block:: c++ + + const uint32_t *CallPreservedMask = + TRI->getCallPreservedMask(MF, F->getCallingConv()); + // Set callee saved register as preserved. + for (unsigned i = 0; i < RegMaskSize; ++i) + RegMask[i] = RegMask[i] | CallPreservedMask[i]; + +This pass is scheduled at very last in target independent code generation +pipeline. + +Register Usage Propagation +-------------------------- +The register usage propagation pass iterates through MachineInstrs in a given +MachineFucntion and at each call site quires RegisterUsageInfo for regmask of +the callee function, if regmask is available then updates call instruction. This +updated regmask will be used by intraprocedural register allocator. This pass is +required to be scheduled before register allocation. + +More Optimization +================= +Due to IPRA register allocation across call site is expected not to use callee’s +clobbered registers for caller, how ever there can be some exception in which +enough registers are not available thus caller may use same register as of +callee. But optimistically when caller and callee uses different set of register +then callee can be optimized by not saving/restoring registers as per calling +convention. If caller is going to use a callee saved registers then it is +caller’s responsibility to save/restore the content of such registers around +call site. To get such optimization we do not reset register bit as per calling +convention while calculating register usage. It also requires to return null for +callee saved registers(TargetFrameLowringImpl.cpp), when inquired by +prolog/epilog inserter pass. Following code snippet explains the implementation: + +.. code-block:: c++ + + if (!TargetFrameLowering::isSafeForNoCSROpt(F)) { + const uint32_t *CallPreservedMask = + TRI->getCallPreservedMask(MF, F->getCallingConv()); + // Set callee saved register as preserved. + for (unsigned i = 0; i < RegMaskSize; ++i) + RegMask[i] = RegMask[i] | CallPreservedMask[i]; + } else { + ++NumCSROpt; + DEBUG(dbgs() << MF.getName() + << " function optimized for not having CSR.\n"); + } + +This optimization is applied only on functions which are having following +properties: internal linkage, not recursive, no indirect call and should not be +optimized as tail call. This is because for the functions which are recursive or +indirectly called or tail call optimized, setting callee saved register set to +null will result in incorrect result or runtime failures. In such situation +parameters and return value can not be passed correctly. Also this optimization +breaks normal calling convention and due to the scope is limited to a +compilation unit other callers in different compilation unit will not +save/restore registers around call site. Due this we only apply this +optimization to functions which have internal linkage +(i.e not visible outside of current compilation unit) + +Example +======= +To explain the effect of this optimization I am using adpcm.c program from +`SNU Real Time Benchmarks `_. +In the source code of ``adpcm.c`` there is a function ``quantl(int el,int +detl)`` which calls function ``my_abs(int n)``. Consider following x86 assembly +code generated for IPRA and NO-IPRA. Here in NO_IPRA case register allocator +will use regmask as per calling convention thus value of ``%edi`` is saved in +``%r14d`` (``r14`` is callee saved as per X86 calling convention) so that value +will be preserved across call ``_abs``. How ever ``_abs`` is not really using +``R8`` register and IPRA will collect this register usage and propagate to call +``_abs`` so in IPRA case register allocator is able to use ``%r8d`` instead of +``%r14d`` and as ``r8`` is not callee saved register as per normal calling +convention ``_quantl`` does not require to preserve it. Similarly ``%rsi`` is +not callee saved as per normal CC so when IPRA is not present then it will be +saved in ``%rbx`` (callee saved). But due to IPRA register allocators knows that +``%esi`` is not clobbered by ``_abs`` and it does not require to save/restore +it. So IPRA is able to save 2 load/store for ``_quantl``. + +.. literalinclude:: exampleipra.s + :diff: examplenoipra.s + +Results +======= +We have applied IPRA along with -O3 and LTO (for multi source application) and +measure compile-time, execution-time and code size changes. + +=========== =================== ===================== ====================== +Benchmark Compile Time Impact Execution Time Impact Executable size impact +=========== =================== ===================== ====================== +SPASS 0.94% -0.35% -0.0039% +D (Parser) 0.36% -2.94% -0.063% +oggenc 0.007% -1.32% -0.013% +sphereflake 0.93% -0.85% -0.23% +=========== =================== ===================== ====================== + + +============ ====================== +Application Executable Size Impact +============ ====================== +git -0.15% +Llvm + clang -0.027% +============ ====================== + +Code +==== +`Patches`_ + +`Bug Reports`_ + +`Discussion`_ + +`Related Mails`_ + +.. _`Patches`: https://reviews.llvm.org/differential/query/VPqsLAiTnbET/ + +.. _`Bug Reports`: https://goo.gl/vJ3m5F + +.. _`Discussion`: https://goo.gl/k9iJdG + +.. _`Related Mails`: https://goo.gl/i4WjXJ + +Conclusions +=========== +A very basic module level interprocedural register allocation has been +implemented. This optimization can help reducing code size and improve +execution time and with LTO this may have a good effect. How ever we have also +observed performance is regressed where libraries are heavily used. So best way +to decide if this helps or not is to try out this optimization. We are also +looking to improve some heuristics for spill code motion in near future. +There are other methods like `Wall 86`_ and `Kurlander and Fischer 96`_ to +implement same optimization with different scope. It is always desired to +implement some of them in LLVM and compare the performance gain. + +.. _`Wall 86`: http://dl.acm.org/citation.cfm?id=989415 + +.. _`Kurlander and Fischer 96`: http://dl.acm.org/citation.cfm?id=237780 + +Acknowledgements +================ +We would like to thank all LLVM community members for their immense help +during this development. We would also like to thank Google Summer of Code for +funding this work. + +References +========== +`Chow 88`_ + +.. _`Chow 88`: http://dl.acm.org/citation.cfm?id=53999 + +`Steen 87`_ + +.. _`Steen 87`: http://dl.acm.org/citation.cfm?id=59289 Index: docs/exampleipra.s =================================================================== --- /dev/null +++ docs/exampleipra.s @@ -0,0 +1,42 @@ +_quantl: + .cfi_startproc + pushq %rbp +Ltmp18: + .cfi_def_cfa_offset 16 +Ltmp19: + .cfi_offset %rbp, -16 + movq %rsp, %rbp +Ltmp20: + .cfi_def_cfa_register %rbp + movl %edi, %r8d + callq _abs + movslq %eax, %rdx + xorl %eax, %eax + leaq _decis_levl(%rip), %r9 + movslq %esi, %rsi + xorl %ecx, %ecx + jmp LBB6_1 + .p2align 4, 0x90 +LBB6_3: + incl %ecx + addq $4, %rax +LBB6_1: + cmpl $29, %ecx + jg LBB6_4 + movslq (%rax,%r9), %rdi + imulq %rsi, %rdi + sarq $15, %rdi + cmpq %rdi, %rdx + jg LBB6_3 +LBB6_4: + testl %r8d, %r8d + js LBB6_6 + leaq _quant26bt_pos(%rip), %rcx + jmp LBB6_7 +LBB6_6: + leaq _quant26bt_neg(%rip), %rcx +LBB6_7: + movl (%rax,%rcx), %eax + popq %rbp + retq + .cfi_endproc Index: docs/examplenoipra.s =================================================================== --- /dev/null +++ docs/examplenoipra.s @@ -0,0 +1,51 @@ +_quantl: + .cfi_startproc + pushq %rbp +Ltmp18: + .cfi_def_cfa_offset 16 +Ltmp19: + .cfi_offset %rbp, -16 + movq %rsp, %rbp +Ltmp20: + .cfi_def_cfa_register %rbp + pushq %r14 + pushq %rbx +Ltmp22: + .cfi_offset %rbx, -32 +Ltmp23: + .cfi_offset %r14, -24 + movl %esi, %ebx + movl %edi, %r14d + callq _abs + movslq %eax, %rcx + xorl %eax, %eax + leaq _decis_levl(%rip), %rdx + movslq %ebx, %rsi + xorl %edi, %edi + jmp LBB6_1 + .p2align 4, 0x90 +LBB6_3: + incl %edi + addq $4, %rax +LBB6_1: + cmpl $29, %edi + jg LBB6_4 + movslq (%rax,%rdx), %rbx + imulq %rsi, %rbx + sarq $15, %rbx + cmpq %rbx, %rcx + jg LBB6_3 +LBB6_4: + testl %r14d, %r14d + js LBB6_6 + leaq _quant26bt_pos(%rip), %rcx + jmp LBB6_7 +LBB6_6: + leaq _quant26bt_neg(%rip), %rcx +LBB6_7: + movl (%rax,%rcx), %eax + popq %rbx + popq %r14 + popq %rbp + retq + .cfi_endproc Index: docs/index.rst =================================================================== --- docs/index.rst +++ docs/index.rst @@ -241,6 +241,7 @@ Bugpoint CodeGenerator ExceptionHandling + IPRA LinkTimeOptimization SegmentedStacks TableGenFundamentals @@ -306,6 +307,10 @@ This document describes the design and implementation of exception handling in LLVM. +:doc:`IPRA` + This document descibes the implementation of interprocedural register + register allocation in LLVM. + :doc:`Bugpoint` Automatic bug finder and test-case reducer description and usage information.