Index: docs/IPRA.rst
===================================================================
--- /dev/null
+++ docs/IPRA.rst
@@ -0,0 +1,289 @@
+===========================================
+Interprocedural Register Allocation in LLVM
+===========================================
+
+.. contents::
+   :local:
+
+Introduction
+============
+This document is meant to be a reference for people who wants to understand how 
+interprocedural register allocation (from now IPRA) is implemented in LLVM.
+This tutorial will provide theoretical background for IPRA and how it is
+implemented inside LLVM and also explain effect by walking through some 
+examples. It is assumed that reader is already familiar with LLVM pass 
+infrastructure and code generation. The basic concepts or intraprocedural 
+register allocation, calling conventions and X86 assembly is also required to 
+understand the examples provided.
+
+Background
+==========
+In literature there are few techniques to achieve interprocedural register 
+allocation. Mainly these techniques are divided into link-time and compile-time 
+optimization. We have implemented a compile-time techniques which is very 
+similar to techniques mentioned in `Steen 87`_, `Chow 88`_. In this technique 
+IPRA tries to improve register allocation by propagating register usage 
+information through the program call graph as each procedure is compiled.
+
+The procedures are compiled in a bottom-up order of the call graph, beginning 
+with leaf nodes. So register usage information can be propagated to caller for 
+callee. LLVM’s intra-procedural register allocators can use this information at 
+each call site and can avoid assigning register already used in the callee 
+functions, as a result it minimizes spill code.
+In general this results in improvement but this may be less effective for the 
+programs whose execution is concentrated in the upper regions of the call graph.
+
+Implementation
+==============
+We have added an option **-enable-ipra** to control this optimization. It can 
+be used as ``-enable-ipra`` with ``llc``, ``-mllvm -enable-ipra`` with ``clang``
+and ``-Wl,-mllvm,-enable-ipra`` when LTO is enabled. To print clobbered 
+registers information use **-print-regusage**.
+The the code generation phase in LLVM operates on a MachineInstr (MI) which is 
+an intermediate representation which is very near to machine assembly. In MI a 
+call instruction is provided with a register mask (regmask)  as one of its 
+operand.
+
+For example in MI a call instruction on X86 architecture looks like following:
+
+::
+
+  CALL64pcrel32 <ga:@decode>, <regmask %BH %BL %BP %BPL %BX %EBP %EBX %RBP %RBX
+  %R12 %R13 %R14 %R15 %R12B %R13B %R14B %R15B %R12D %R13D %R14D %R15D %R12W
+  %R13W %R14W %R15W>, %RSP<imp-use>, %EDI<imp-use>, %RSP<imp-def>
+
+
+The regmask is generated based on the calling convention (CC) for the given 
+function. The calling convention describes the contract between callee and 
+caller functions so that both of them can user registers such that code does not
+break on function call and return. So the regmask describes the registers which
+are guaranteed to be preserved across call. So in the above example register 
+allocator can deduce that decode function will preserve ``RBX``, ``R13`` etc 
+register. How ever this information is based on static decision made by 
+following a specific CC, but it may be the case that function decode is not 
+using even more registers for example ``R8``, ``R9`` etc so having that 
+information in above register mask will help register allocator by providing 
+more free registers to use and this is what actually interprocedural register 
+allocator tries to do.
+
+To determine actual register usage it is required that the procedure is already 
+codegen and register allocation phase has been executed. This also implies that 
+before the codegen of every caller, the callee should be codegen and this is why
+IPRA requires codegen to be followed on bottom-up order of call graph. So total
+4 passes are added to LLVM :
+
+#. DummyCGSCCPass - To change code generation order
+#. PhysicalRegisterUsageInfo - An immutable pass to store register usage details
+#. RegUsageInfoCollector - To collect actual register usage post register allocation
+#. RegUsageInfoPropagationPass - To propagate register usage info for call sites Each of this pass is explained in details below.
+
+
+Change CodeGen Order
+--------------------
+Other interprocedural optimization also requires codegen order to be bottom up
+on call graph. LLVM provides CallGraphSCC (strongly connected component) pass
+for this purpose. We added a DummyCGSCCPass which actually does not perform any
+thing but FunctionPassManager will be wrapped inside CallGraphSCCPassManager.
+TargetPassConfig configures target independent passes to be used by LLVM backend
+. In presence of EnableIPRA target option we added CGSCCPass to pass manager. 
+
+Store Register Usage Details
+----------------------------
+The scope of current implementation is limited to a module. To be able to
+propagate register usage info from callee to caller we need to manage details
+till all the functions defined in the module is compiled. To do this we have
+used an immutable pass, it does not required to implement run method as
+immutable pass is never run. However immutable pass is not destroyed until whole
+module is compiled. This pass holds a map of Function * to reg mask. The
+collection pass will put entries into the map and propagation pass will query
+the result at call sites. The bit mask value 0 means register will be clobbered
+by the function and 1 means register will be preserver across the function call. 
+
+Register Usage Collection
+-------------------------
+The RegUsageInfoCollector pass will run post register allocation and it will a
+build a regmask for given function and store it in the PhysicalRegisterUsageInfo
+. So it is implied that PhysicalRegisterUsageInfo is required to be initialized
+before collection pass. I would like to mention few important points in the 
+code.
+
+The regmask is initialized with all bits set and each bit represents a physical
+register on the given target.
+
+.. code-block:: c++
+
+  std::vector<uint32_t> RegMask;
+  // Compute the size of the bit vector to represent all the registers.
+  // The bit vector is broken into 32-bit chunks, thus takes the ceil of
+  // the number of registers divided by 32 for the size.
+  unsigned RegMaskSize = (TRI->getNumRegs() + 31) / 32;
+  RegMask.resize(RegMaskSize, 0xFFFFFFFF);
+
+The following loop is crux of the register collection pass. It iterates through
+all physical registers on given target and check if the given register is really
+used by the function. If register is used then mark it as clobbered.
+
+.. code-block:: c++
+
+  for (unsigned PReg = 1, PRegE = TRI->getNumRegs(); PReg < PRegE; ++PReg)
+    if (MRI->isPhysRegModified(PReg, true))
+      RegMask[PReg / 32] &= ~(1u << PReg % 32);
+
+Time complexity of above for loop is O(# registers on target) including alias
+registers on X86 like architecture. How ever this loop does not consider calling
+convention for the function. So we need to set bits for registers which are
+preserved due to calling convention as shown in code below. And finally the
+register mask is stored in PhysicalRegisterUsageInfo.
+
+.. code-block:: c++
+
+  const uint32_t *CallPreservedMask =
+        TRI->getCallPreservedMask(MF, F->getCallingConv());
+  // Set callee saved register as preserved.
+      for (unsigned i = 0; i < RegMaskSize; ++i)
+        RegMask[i] = RegMask[i] | CallPreservedMask[i];
+
+This pass is scheduled at very last in target independent code generation
+pipeline.
+
+Register Usage Propagation
+--------------------------
+The register usage propagation pass iterates through MachineInstrs in a given
+MachineFucntion and at each call site quires RegisterUsageInfo for regmask of
+the callee function, if regmask is available then updates call instruction. This
+updated regmask will be used by intraprocedural register allocator. This pass is
+required to be scheduled before register allocation.
+
+More Optimization
+=================
+Due to IPRA register allocation across call site is expected not to use callee’s
+clobbered registers for caller, how ever there can be some exception in which
+enough registers are not available thus caller may use same register as of
+callee. But optimistically when caller and callee uses different set of register
+then callee can be optimized by not saving/restoring registers as per calling
+convention. If caller is going to use a callee saved registers then it is
+caller’s responsibility to save/restore the content of such registers around
+call site. To get such optimization we do not reset register bit as per calling
+convention while calculating register usage. It also requires to return null for
+callee saved registers(TargetFrameLowringImpl.cpp), when inquired by
+prolog/epilog inserter pass. Following code snippet explains the implementation:
+
+.. code-block:: c++
+
+  if (!TargetFrameLowering::isSafeForNoCSROpt(F)) {
+    const uint32_t *CallPreservedMask =
+        TRI->getCallPreservedMask(MF, F->getCallingConv());
+    // Set callee saved register as preserved.
+    for (unsigned i = 0; i < RegMaskSize; ++i)
+      RegMask[i] = RegMask[i] | CallPreservedMask[i];
+  } else {
+    ++NumCSROpt;
+    DEBUG(dbgs() << MF.getName()
+                 << " function optimized for not having CSR.\n");
+  }
+
+This optimization is applied only on functions which are having following
+properties: internal linkage, not recursive, no indirect call and should not be
+optimized as tail call. This is because for the functions which are recursive or
+indirectly called or tail call optimized, setting callee saved register set to
+null will result in incorrect result or runtime failures. In such situation
+parameters and return value can not be passed correctly.  Also this optimization
+breaks normal calling convention and due to the scope is limited to a
+compilation unit other callers in different compilation unit will not
+save/restore registers around call site. Due this we only apply this
+optimization to functions which have internal linkage
+(i.e not visible outside of current compilation unit)
+
+Example
+=======
+To explain the effect of this optimization I am using adpcm.c program from
+`SNU Real Time Benchmarks <http://www.cprover.org/goto-cc/examples/snu.html>`_. 
+In the source code of ``adpcm.c`` there is a function ``quantl(int el,int
+detl)`` which calls function ``my_abs(int n)``. Consider following x86 assembly
+code generated for IPRA and NO-IPRA. Here in NO_IPRA case register allocator
+will use regmask as per calling convention thus value of ``%edi`` is saved in
+``%r14d`` (``r14`` is callee saved as per X86 calling convention) so that value
+will be preserved across call ``_abs``. How ever ``_abs`` is not really using
+``R8`` register and IPRA will collect this register usage and propagate to call
+``_abs`` so in IPRA case register allocator is able to use ``%r8d`` instead of 
+``%r14d`` and as ``r8`` is not callee saved register as per normal calling
+convention ``_quantl`` does not require to preserve it. Similarly ``%rsi`` is
+not callee saved as per normal CC so when IPRA is not present then it will be
+saved in ``%rbx`` (callee saved). But due to IPRA register allocators knows that
+``%esi`` is not clobbered by ``_abs`` and it does not require to save/restore
+it. So IPRA is able to save 2 load/store for ``_quantl``.
+
+.. literalinclude:: exampleipra.s
+   :diff: examplenoipra.s
+
+Results
+=======
+We have applied IPRA along with -O3 and LTO (for multi source application) and 
+measure compile-time, execution-time and code size changes.
+
+===========   ===================  =====================  ======================
+Benchmark     Compile Time Impact  Execution Time Impact  Executable size impact
+===========   ===================  =====================  ======================
+SPASS         0.94%                -0.35%                 -0.0039%
+D (Parser)    0.36%                -2.94%                 -0.063%
+oggenc        0.007%               -1.32%                 -0.013%
+sphereflake   0.93%                -0.85%                 -0.23%
+===========   ===================  =====================  ======================
+
+
+============  ======================
+Application   Executable Size Impact
+============  ======================
+git           -0.15%
+Llvm + clang  -0.027%
+============  ======================
+
+Code
+====
+`Patches`_
+
+`Bug Reports`_
+
+`Discussion`_
+
+`Related Mails`_
+
+.. _`Patches`: https://reviews.llvm.org/differential/query/VPqsLAiTnbET/
+
+.. _`Bug Reports`: https://goo.gl/vJ3m5F
+
+.. _`Discussion`: https://goo.gl/k9iJdG
+
+.. _`Related Mails`: https://goo.gl/i4WjXJ
+
+Conclusions
+===========
+A very basic module level interprocedural register allocation has been 
+implemented. This optimization can help reducing code size and improve 
+execution time and with LTO this may have a good effect. How ever we have also 
+observed performance is regressed where libraries are heavily used. So best way 
+to decide if this helps or not is to try out this optimization. We are also 
+looking to improve some heuristics for spill code motion in near future.
+There are other methods like `Wall 86`_ and `Kurlander and Fischer 96`_ to 
+implement same optimization with different scope. It is always desired to 
+implement some of them in LLVM and compare the performance gain.
+
+.. _`Wall 86`: http://dl.acm.org/citation.cfm?id=989415
+
+.. _`Kurlander and Fischer 96`: http://dl.acm.org/citation.cfm?id=237780
+
+Acknowledgements
+================
+We would like to thank all LLVM community members for their immense help 
+during this development. We would also like to thank Google Summer of Code for 
+funding this work.
+
+References
+==========
+`Chow 88`_
+
+.. _`Chow 88`: http://dl.acm.org/citation.cfm?id=53999
+
+`Steen 87`_
+
+.. _`Steen 87`: http://dl.acm.org/citation.cfm?id=59289
Index: docs/exampleipra.s
===================================================================
--- /dev/null
+++ docs/exampleipra.s
@@ -0,0 +1,42 @@
+_quantl:                                
+	.cfi_startproc
+	pushq	%rbp
+Ltmp18:
+	.cfi_def_cfa_offset 16
+Ltmp19:
+	.cfi_offset %rbp, -16
+	movq	%rsp, %rbp
+Ltmp20:
+	.cfi_def_cfa_register %rbp
+	movl	%edi, %r8d
+	callq	_abs
+	movslq	%eax, %rdx
+	xorl	%eax, %eax
+	leaq	_decis_levl(%rip), %r9
+	movslq	%esi, %rsi
+	xorl	%ecx, %ecx
+	jmp	LBB6_1
+	.p2align	4, 0x90
+LBB6_3:                                 
+	incl	%ecx
+	addq	$4, %rax
+LBB6_1:                                 
+	cmpl	$29, %ecx
+	jg	LBB6_4
+	movslq	(%rax,%r9), %rdi
+	imulq	%rsi, %rdi
+	sarq	$15, %rdi
+	cmpq	%rdi, %rdx
+	jg	LBB6_3
+LBB6_4:                                 
+	testl	%r8d, %r8d
+	js	LBB6_6
+	leaq	_quant26bt_pos(%rip), %rcx
+	jmp	LBB6_7
+LBB6_6:                                 
+	leaq	_quant26bt_neg(%rip), %rcx
+LBB6_7:                                 
+	movl	(%rax,%rcx), %eax
+	popq	%rbp
+	retq
+	.cfi_endproc
Index: docs/examplenoipra.s
===================================================================
--- /dev/null
+++ docs/examplenoipra.s
@@ -0,0 +1,51 @@
+_quantl:                                
+	.cfi_startproc
+	pushq	%rbp
+Ltmp18:
+	.cfi_def_cfa_offset 16
+Ltmp19:
+	.cfi_offset %rbp, -16
+	movq	%rsp, %rbp
+Ltmp20:
+	.cfi_def_cfa_register %rbp
+	pushq	%r14
+	pushq	%rbx
+Ltmp22:
+	.cfi_offset %rbx, -32
+Ltmp23:
+	.cfi_offset %r14, -24
+	movl	%esi, %ebx
+	movl	%edi, %r14d
+	callq	_abs
+	movslq	%eax, %rcx
+	xorl	%eax, %eax
+	leaq	_decis_levl(%rip), %rdx
+	movslq	%ebx, %rsi
+	xorl	%edi, %edi
+	jmp	LBB6_1
+	.p2align	4, 0x90
+LBB6_3:                                 
+	incl	%edi
+	addq	$4, %rax
+LBB6_1:                                 
+	cmpl	$29, %edi
+	jg	LBB6_4
+	movslq	(%rax,%rdx), %rbx
+	imulq	%rsi, %rbx
+	sarq	$15, %rbx
+	cmpq	%rbx, %rcx
+	jg	LBB6_3
+LBB6_4:                                 
+	testl	%r14d, %r14d
+	js	LBB6_6
+	leaq	_quant26bt_pos(%rip), %rcx
+	jmp	LBB6_7
+LBB6_6:                                 
+	leaq	_quant26bt_neg(%rip), %rcx
+LBB6_7:                                 
+	movl	(%rax,%rcx), %eax
+	popq	%rbx
+	popq	%r14
+	popq	%rbp
+	retq
+	.cfi_endproc
Index: docs/index.rst
===================================================================
--- docs/index.rst
+++ docs/index.rst
@@ -241,6 +241,7 @@
    Bugpoint
    CodeGenerator
    ExceptionHandling
+   IPRA
    LinkTimeOptimization
    SegmentedStacks
    TableGenFundamentals
@@ -306,6 +307,10 @@
    This document describes the design and implementation of exception handling
    in LLVM.
 
+:doc:`IPRA`
+   This document descibes the implementation of interprocedural register
+   register allocation in LLVM.
+
 :doc:`Bugpoint`
    Automatic bug finder and test-case reducer description and usage
    information.