This is an archive of the discontinued LLVM Phabricator instance.

Paths

Table of Contentst

-
include/
-
llvm-c/Transforms/
-
Transforms/
-
Scalar.h
-
llvm/
-
InitializePasses.h
-
Transforms/
-
Scalar.h
-
Scalar/
-
RecursionStackElimination.h
-
lib/
-
Passes/
-
PassBuilder.cpp
-
Transforms/
-
IPO/
1
PassManagerBuilder.cpp
-
Scalar/
-
CMakeLists.txt
-
RecursionStackElimination.cpp
-
Scalar.cpp

Differential D53706

[MultiTailCallElimination]: Pass to eliminate multiple tail calls
Needs ReviewPublic

Authored by marels on Oct 25 2018, 8:26 AM.

Download Raw Diff

Details

Reviewers

t.p.northover
asl
chandlerc
deadalnix
john.brawn

Summary

This pass converts multiple tail recursive calls into a loop, by modeling the
calls as a single linked worklist explicitly on the stack.

void f(a, b, const c) {
  ...
  if (...)
    return

  a1, b1, b2, b3, x2, x3 = ...

  f(a1, b1, c) // 1st recursion
  a2 = h(x2)
  f(a2, b2, c) // 2nd recursion
  a3 = h(x3)
  f(a3, b3, c) // 3rd recursion
}

transforms to

void f(a, b, const c) {
  struct { x, b, next } *worklist = null

loop:
  ...
  if (...)
    goto check_return

  a1, b1, b2, b3, x2, x3 = ...

  /* Assign arguments for the first call */

  a = a1
  b = b1

  /* Put arguments of the remaining calls into the work list */

  queue = alloca(2 * sizeof(*worklist))
  queue[0] = {x2, b2, &queue[1]}
  queue[1] = {x3, b3, worklist}
  worklist = queue

  goto loop

check_return:
  if (!worklist)
    return
  a = h(worklist->x)
  b = worklist->b
  worklist = worklist->next

  goto loop
}

Such patterns occur for example when an application traverses k-ary full trees.

The benefits of this transformation is, that neither frame nor link address
have to be stored on the stack. Also pass through arguments, like 'const c'
in example above, are less likely required to saved onto the stack, and if so
less often (only once in the entry block).

The downsides are:

a) An additional store for the worklist is required.

b) The worst case required stack memory after this transformation depends on the number of recursive calls, instead of the recursion depth.

c) Additional conditional branches

ad a) This point is compensated by avoiding storing the link address in each call.

ad b) This pass additionally can add (and does it by default) code to manage unused blocks of allocas in a freelist. Because allocations are done in blocks the freelist management, can also be done in blocks. This requires the last item in each block to be marked, detectable by the code within return. Currently the following mark algorithms are implemented: A further variable is stored on the stack containing the current freelist. On the other hand by not executing a recursion the frame pointer loads and stores are omitted.

ad c) The pass adds 1 conditional branch into the return path and 2 additional branches for freelist management (see (b) above). Depending on the target, machine branch prediction can elevate this.

Algorithm outline:

Analysis Phase:

Analyze the function and gather the returning basic blocks (BB) (and BB branching to return-only BB) with recursive calls (RC).
If more the one BB or no RC is found abandon the transformation.
If the remaining BB has only one RC abandon the transformation. Otherwise let N be number of RC in this BB.
Analyze the instructions in this BB and from the first RC until the terminator instruction and classify each instruction as movable and static. A movable instruction is and instruction that can be safely moved before the first RC. All other instructions are classified static.
Assign each static instruction to the following RC instruction. If static instructions are left after the last RC abandon the transformation.
Build the function H with all its arguments for each RC. By including the call itself in function H it is ensured that this function and enforcing this function to return void. It is ensured that there are no escaping values uses after the recursion. Note (6): By the way step 4 is executed it is guaranteed that function H for the first RC consists of a single instruction; the call itself. The first call candidate is handled special (the same way as in Tail Recursion Elimination (TRE)). Note (5,6): The information collected on each RC is collected in the structure RecursiveCall.
Compare the second's function H with all later ones. The behavior must match, otherwise abandon the transformation.
Note: As the first RCs function H basically a TRE it can be ignored in this step.
Simplify the argument list by removing constants and pass through arguments.
Decide whether it is profitable to use the transformation.

Transformation Phase:

Adjust entry block and split of the loop entry.
Eliminate the first RC (similar to TRE).
Eliminate the remaining RC by allocating and filling an array new (or pick it from the freelist) block of N-1 struct items. This array is put in the front of the list. Pulling the list is added in (4). The execution of function H is ensured in (5).
Create a new return block which pulls items from the worklist. If an and of block marker is reached. The block is put into the freelist.
Add the instruction from function H into the return block and create the loop.
Redirect each returning block into the new return block created in (4).
Drop the constant STACKGROWTHDIRECION. It is manly uses as a proof of concept for Aarch64.

Open issues and known TODOs - It would be great if reviewer could comment on
those as well:

Pipeline integration: Currently the pass is put before TRE. This includes some supporting passes, for cleanup and preparation. This cannot be left as is. The preferred way could be by adjusting "AArch64TargetMachine::adjustPassManager" and only use it with Opt3 . AAarch64 is selected, because this is the architecture the author has suitable benchmarking and test setups available. This was tried once (in a similar way as in AMDGPUTargetMachine::adjustPassManager), however the result was that may LLVM projects did not compile anymore because of linker problems (Passes library was missing). Do you have any advise here?
The way to test if it profitable to use the pass needs adjustment. I think that functions, that spill more registers have an increased chance to profit, while functions that spill less, have lower chance for profit.
Thinking of a configurable way (maybe a separate marker class) to adjust the way markers implemented. E.g.: putting the marker bit into the pointer on Aarch64 show a significant performance boost (test implemented in C).
Is it safe to temporary create function and return no-changes after deleting it? If not is there a better way to than calling using the FunctionComparator?
GlobalsAA needs to preserved. Not sure about this in this context. Loads and Stores are added here.

Diff Detail

Repository: rL LLVM

Event Timeline

marels created this revision.Oct 25 2018, 8:26 AM

Herald added a reviewer: deadalnix. · View Herald TranscriptOct 25 2018, 8:26 AM

Herald added subscribers: llvm-commits, kristof.beyls, tpr and 3 others. · View Herald Transcript

Revised broken formatting in summary

dmgreen added a subscriber: dmgreen.Oct 26 2018, 1:20 AM

ping

Very interesting work!

A few general comments:

For how to integrate this into the pass pipeline, I think it probably makes sense to put this just after inlining as this is kind of like inlining - we're seeing some function calls and eliminating them by putting extra stuff into this function (I experimented with this and it seemed to work). Instead of having the target insert the pass I think it makes more sense to have a heuristic to decide when to do this transformation that depends on the target, with a default implementation of "don't do this" (see next point).

When we do this transformation we are:

Saving the cost of the call instruction
Saving the cost of any saving and restoring of registers in the recursed function
Saving the cost of having to save the unchanging arguments across several calls
Incurring the cost of managing the call list and free list
Have the opportunity to CSE/LICM subexpressions using the unchanging arguments

so in terms of heuristics we want to do it if (saved_cost + expected_opportunity) > incurred_cost, and the saved cost is dependant mainly on the cost of saving/restoring callee-saved registers (or at least that's what it looks like on aarch64). So it should involve some kind of calls to cost functions in the target.

The current implementation does the call list as one entry = one recursive call, so the cost of managing the call list is proportional to the number of recursive calls. We could instead have a 'chunk' of calls equal to the number of recursive calls, e.g.

struct tree {
  double val;
  struct tree *children[4];
};

void function_to_optimise(struct tree *p, const double a, const double b) {
  if (!p)
    return;

  p->val += sin(a) + sin(b);
  f(p->children[0], a, b);
  f(p->children[1], a, b);
  f(p->children[2], a, b);
  f(p->children[3], a, b);
}

struct chunk {
  struct chunk *next;
  long int idx;
  struct tree *vals[4];
};
void function_optimised(struct tree *p, const double a, const double b) {
  struct chunk *list = 0;
  struct chunk *freelist = 0;
  struct tree *current_p = p;
  goto first;

  while(list) {
    // move to next chunk if at end
    while (list->idx >= 4) {
      struct chunk *tmp = list;
      list = list->next;
      tmp->next = freelist;
      freelist = tmp;
      if (!list)
        return;
    }
    current_p = list->vals[list->idx++];

  first:
    // early exit
    if (!current_p)
      continue;

    // do the operation
    current_p->val += sin(a) + sin(b);

    // add recursive calls to list
    struct chunk *tmp;
    if (freelist) {
      tmp = freelist;
      freelist = freelist->next;
    } else {
      tmp = alloca(sizeof(struct chunk));
    }
    tmp->idx = 0;
    memcpy(tmp->vals, current_p->children, 4 * sizeof(struct tree *));
    tmp->next = list;
    list = tmp;
  };
}

We now only have one allocation / list manipulation instead of one per recursive call, though at the loop head we have some extra complexity.

Also I think the current freelist handling isn't quite right - looking at the generated code it's checking on the _first_ time it adds to the worklist if there's an element in the freelist it can use so e.g. if we have 4 recursive calls to add and 2 freelist entries it will only use the first freelist entry and do 3 allocations, but it should do 2 allocations and use the 2 freelist entries. (Using the chunked approach would avoid this as only one chunk is ever added at once.)

Also, we should be disabling this transformation when optimising for size.

Hi & Thank you for the input:

In D53706#1296831, @john.brawn wrote:

A few general comments:

[For how to integrate this into the pass pipeline ...]:

I will take a look into this.

When we do this transformation we are:

Saving the cost of the call instruction

Saving the cost of any saving and restoring of registers in the recursed function

Saving the cost of having to save the unchanging arguments across several calls

Incurring the cost of managing the call list and free list

Have the opportunity to CSE/LICM subexpressions using the unchanging arguments

so in terms of heuristics we want to do it if (saved_cost + expected_opportunity) > incurred_cost, and the saved cost is dependant mainly on the cost of saving/restoring callee-saved registers (or at least that's what it looks like on aarch64). So it should involve some kind of calls to cost functions in the target.

This is an interesting idea I will take into account for my next update.

Also I think the current freelist handling isn't quite right - looking at the generated code it's checking on the _first_ time it adds to the worklist if there's an element in the freelist it can use so e.g. if we have 4 recursive calls to add and 2 freelist entries it will only use the first freelist entry and do 3 allocations, but it should do 2 allocations and use the 2 freelist entries. (Using the chunked approach would avoid this as only one chunk is ever added at once.)

I am not sure what you mean with "4 recursive calls". Do you mean depth or width. The number of alloca's executed depends on the recursion depth and not on the width. Compared to you approach (see next point) currently the items are already alloced in chunks where the "next" pointers are assigned after allocation.

Could you please provide a more details on this. Sorry, I cannot follow your point.

[chunked approach]

I head a similar idea while implementing this but I did not implement because I thought it required non uniform code generation for the first and the remaining recursive calls. Note in case of 4 recursive calls, like in your example, only three items are maintained in the worklist; the first recursion is never put in the worklist because it can be handled just like tail recursion. However I think you example can be adopted to do this as well.

Thinking of the reducing the overhead for freelist management, I think there is a simple way to adopt the current implementation as well:

Currently the exit path looks like this:

loopentry:
  ...

return_path:
  if (worklist == nullptr)
    return;

  mark = marker.isLastItem();
  if (mark) {
    // The values of FIRST_ITEM and  LAST_TEM is computed at compile time (it is a Constant)
    currentlist[LAST_ITEM].next = freelist;
    freelist = currentlist[FIRST_ITEM];
  }

  current_params = worklist.params;
  worklist = worklist.next;
  goto loopentry;

This implementation requires a check for adding to the freelist and a check for termination in each loop iteration. However it is an invariant that:

(mark == true) -> (listitem->next != nullptr)

Using this information the exitpath can be changed to

loopentry:
  ...

return_path:
  mark = marker.isLastItem();
  if (mark) {
    currentlist[LAST_ITEM].next = freelist;
    freelist = currentlist[FIRST_ITEM];
    if (worklist == nullptr)
      return;
  }

  current_params = worklist.params;
  worklist = worklist.next;
  goto loopentry;

In this case it is also save that worklist is not null when fetching the parameters for the next iteration from it. Note that worklist always maintains the information for the next iteration.

Comparing this to the your proposal I think it should behave similarly regarding performance. (In compare for inner callse, 2 compares for the last call, one compare to put items into the worklist)
Note that currently the data is allocated in chunks of (n-1) items, where n is the witdh of the recursion. The main difference is the way the list item in the chunk is marked. You use an extra index per chunk, the current implementation some uses information stored in each item within a chunk (Bit in pointer (I can push a prototype implementation for this), extra field in item, using address and stack properties and comparing worklist < worklist->next).

Summary:

Alloca Overhead:

Currently: On alloca the inner structure of the allocated chunk needs to be initialised and creates an overhead (item[0].next = &item[1], ...); This steps are not executed when using items within the freelist.
Your Approach: On alloca no overhead is created

Insert To Worklist Overhead:

Current: Pointer manipulation
Your Approach: Pointer Manipulation and 1 store to initialise the index

Exit Path:

Currently: Check Return, Handle Freelist, Pointer Manipulation
Currently: [using the modifications above]: Check Freelist, Check Return [iff last item reached], Pointer Manipulation
Your Approach: Check Index, Increment Index, Check Return [iff last item reached], Pointer Manipulation [iff last item reached]

Also, in your approach the index can be omitted if the recursion witdh is 2, but similar things are done in the current approach.

I think both approaches are similar regarding performance. I will try this approach and do some measurements, unless you (@john.brawn) already dis some performance tests and provide results.
If you proposal is beneficial or equal in performance I I will adopt it et least for the following reasons:

We can get rid of the marker code which complicates the pass and also may be target dependent (Growing, Shrinking Stack; or Stack alignment to tag pointer).
Slightly less memory used
By changing your increasing to a decreasing pointer, the pass becomes more flexible and can easily adopted to more rare cases like:

void fn(....) {
  //execute body

  if (...) {
    // three recursions
    fn(...)
    fn(h(...), ...)
    fn(h(...), ...)
  }
  if (...) {
    // five recursions
    fn(...)
    fn(h(...), ...)
    fn(h(...), ...)
    fn(h(...), ...)
    fn(h(...), ...)
  }
}

I think there was some confusion in how the lists are managed.

From code analyses the pass knows how wide the recursion is. Wide means the number of recursive calls within a the function. For example qsort would have 2. Traversing an Octtree calls itself 8 times.

Work-list

When the code reaches the first of those n calls, all information is available and the call can be 'emulated' be adjusting some PHINode and
looping to the start of the function. This is basically the same as for Tail Call Elimination. However, before doing so information about the remaining n-1 calls need to be queued in the work-list.

While the allocation of the work-list items is done in chunks (array) of n-1 consecutive items, they are internally linked in single linked
list. Items are always added in chunk of n-1 at the front, and removed one by one from the front.

To clarify head pointer of the work-list always points to the next item to be processed and NOT to the current item. The information of the current arguments is extracted before branching to the loop-entry.

Because the allocation is done in chunks, when processing the last item, the address of the first item can be computed by an subtraction (in this our case this is done by a GEP instruction).

This is also the point where free list management comes into play. To reduce the amount of stack required unused chunks are stored within a free-list.

Free-List

The free-list is also a linked list, but in contrast to the work-list it logical links chunks. As link pointer the last next pointer the array is used.

Whenever a new chunk is allocated it is first taken from the free list if there is one available. If the free-list is empty a new chunk is allocated by executing an alloca instruction.

List-Items

To execute the algorithm each item stores the following information.

Arguments: These are the function arguments that are necessary to execute the function.
Next-Pointer: This is the link pointer to maintain the worklist.
Marker: The marker used to mark the last item in a chunk. Whenever a marked item is removed from the work-list. The complete chunk has to be put into the free-list.

Example Work-List:

After a couple of steps the work-list might log like denoted in Fig. 1 while executing Step 3.3 for a recursion that with wideness 4 (e.g. Quad-Tree traversing). Note the Step 1 and 3.1 are omitted because the are never allocated within the work-list.

The first column denotes the Arguments; the second the Marker, and the third the next pointer.

Fig. 1: Work-List and free-List while executing Step 3.3. Note that
work-list already points to Step 3.4 even when currently executing
Step 3.3.

+----------+---+-----+
| Step 2   |   | *   |
+----------+---+ | --+       +----------+---+-----+
| Step 3   |   | v * |       | Step 3.2 |   | *   |
+----------+---+-- | |       +----------+---+ | --+
| Step 4   | M | * v | <-\   | Step 3.3 |   | v * |
+----------+---+-|---+   |   +----------+---+-- | +
		 v	 |   | Step 3.4 | M | * v | <- [work-list]
	      nullptr	 |   +----------+---+ | --+
                         |                    |
                         \--------------------/

                             +----------+---+-----+
			     |          |   | *   | <- [free-list]
			     +----------+---+ | --+
                             |          |   | v * |
			     +----------+---+-- | +
			     |          | M | * v |
			     +----------+---+ | --+
			                      v
					   nullptr

Fig. 2 show the state of the maintained structures while executing Step 3.4. The changes are made just before branching to the loop entry.

Fig. 2: Work-list and free-list while executing Step 3.4.

+----------+---+-----+
| Step 2   |   | *   |
+----------+---+ | --+       +----------+---+-----+
| Step 3   |   | v * |       |          |   | *   | <- [free-list]
+----------+---+-- | |       +----------+---+ | --+
| Step 4   | M | * v | <-\   |          |   | v * |
+----------+---+-|---+   |   +----------+---+-- | +
		 v	 |   |          | M | * v |
	      nullptr	 |   +----------+---+ | --+
                         |                    |
                         \--- [work-list]     \-\
                                                |
                             +----------+---+-- | +
			     |          |   | * v |
			     +----------+---+ | --+
                             |          |   | v * |
			     +----------+---+-- | +
			     |          | M | * v |
			     +----------+---+ | --+
			                      v
					   nullptr

From Fig. 1 and Fig. 2 one see the following:

The next-pointer of unmarked elements are constant. They are only assigned once when allocating a new chunk. Chunks within the free-list already contain the correct information.

Because of (1) the next pointer of unmarked items always point the a valid item. Thus a return check can be omitted for unmarked items.

The free list management is only necessary if a marked item is removed from the work-list.

Markers

In order to check if free-list management is necessary a marker algorithm is executed. To determine the M field.

The next section list the algorithms that have been investigated so far:

1) Field Marker: Field markers maintain an explicit bit that stores the M flag in a separate field. An item is marked if the bit is set. The chunks look like this:

struct chunk {
  struct item {
    struct { ... } Arguments;
    struct item *Next;
    bool Marker;
  } items[N-1];
};

2) Chunked Marker (by @john.brawn): Chunked Markers omit the next pointer and replace them by an index storing a reference to the next item (+1). The worklist always points to the next chunk in the queue. An marked item is reached iff when decrementing the index it becomes 0. A chunk looks like this.

struct chunk {
  struct item {
    struct { ... } Arguments;
  } items[N-1];
  unsigned Index;
  struct chunk *Next.
};

3) Compare Marker: The Compare Marker makes use of the order in which chunks are allocated. Depending on the stack growth direction the marking can be determined by executing (item < item->next) or (item > item->next).

I don not go into the details but this works as long as the following condition holds when executing 2 allocas in a temporal order.

Assume:

a = alloca(X)
b = alloca(Y)

If X > 0 and Y > 0 then either (a < b) or (a > b) must hold.

However I think LLVMs alloca semantics (theoretically) might break this requirement.

A chunk looks like this:

struct chunk {
  struct item {
    struct { ... } Arguments;
    struct item *Next;
  } items[N-1];
};

4) Tagged Marker A: Tagged Marker use the same chunk layout as Compare Markers. The difference is that the marking is encoded within Bit 0 of pointers that point to the item. An item is marked it Bit 0 in the pointer is cleared.

This works as long as all the alignment of each list item is 2 or a multiple of 2. And that if before dereferencing an item pointer Bit 0 is masked.

The return code for markers 1,2,3 and 4

Each return in the function is replaced by a branch to the following
return code.

returnpath:
  if (!worklist)
    return;

  tie(marked, item, oldchunk) = marker->execute_and_advance(worklist);
  if (marked)
     marker->add_to_freelist(oldchunk);

  next_arguments = marker->advance_and_return_next_item(worklist);

  // execute H(...) here
  goto loop_entry;

5) Tagged Marker B: Beside the tagged marker each marker must dereference the work-list thus the return check must be executed first and for each element. Because tagged markers only need the pointer value itself, the return code for that tagged marker can be further optimized.

returnpath_tagged:
  tie(marked, item, oldchunk) = marker->execute_and_advance(worklist);

  if (marked) {
    if (!worklist)
      return;

    marker->add_to_freelist(oldchunk);
  }

  next_arguments = marker->advance_and_return_next_item(worklist);

  // execute H(...) here
  goto loop_entry;

6) Always True Marker: This is a trivial marker which is always (and implicitly true). This marker applies to function with wideness 2 only. It can be used with the tagged return path.

7) Always False Marker: This trivial marker disables free-lists at all. It must be with the untagged return path.

I also did some measurements on a use-case to compare the marker algorithm (AArch64): Physics-Simulation in the hot loop the application traverses a full octree.

The resulting performance was: PassDisabled < Chunked < Field < Compare = Tagged A < Tagged B.
Note: I do not have an llvm implementation for the Chunked Marker do I tested it on modified C code.

Because the Tagged B marker I am tending to redo this part of the code generation and drop all other markers besides the always true and maybe the always false marker algorithm.

What do you think about it?

Changes

Renamed Pass to MultiTailCallElimination as it seems more fitting
Removed Unnecessary Marker Implementation
Changed Code Generation quite a bit
Add heuristic allowing to skip transformation if not profitable
Add Test Cases
Enable pass only for O3

I am still not happy with the pipeline integration. A can only reasonable test this for AArch64 and want to enable it only for this target. Other machines might need different heuristics.

Any Ideas on this?

lebedev.ri added a subscriber: lebedev.ri.Dec 4 2018, 7:31 AM

lebedev.ri added inline comments.

lib/Transforms/Scalar/MultiTailCallElimination.cpp
410 ↗	(On Diff #176633)	Please apply clang-format to this patch. 80 col width.
test/Transforms/MultiTailCallElimination/lit.local.cfg
1–2 ↗	(On Diff #176633)	This looks wrong. This should then be in `test/Transforms/MultiTailCallElimination/AArch64/lit.local.cfg` The top-level `test/Transforms/MultiTailCallElimination/lit.local.cfg` should not be target-specific, or at least not specific to some one target.

marels marked 2 inline comments as done.Dec 4 2018, 7:39 AM

marels added inline comments.

lib/Transforms/Scalar/MultiTailCallElimination.cpp
410 ↗	(On Diff #176633)	I will go through the source again fix these kind of things in the complete file. Thanks for the remainder.
test/Transforms/MultiTailCallElimination/lit.local.cfg
1–2 ↗	(On Diff #176633)	I feel the same. I will fix this with the next commit. However this is currently related to the Pipeline Integration issue. Which I currently do not like at all. Do you have an idea how to enable this pass specifically for AArch64. Clearly the pass has to run before TailCallElimination as TCE will pick of the last call and disable this pass completely.

lebedev.ri added inline comments.Dec 4 2018, 7:49 AM

test/Transforms/MultiTailCallElimination/lit.local.cfg
1–2 ↗	(On Diff #176633)	Probably based on a target triple specified in the IR. Though i can't tell how to reach that info from this pass.. But well, this is middle end, these passes generally shouldn't be too target arch specific. Is this likely fundamentally wrong for non-aarch64?

I implemented in a way such that it should be correct work on all architectures. I just use the TargetTransformInfo for the heuristics. Everything else completely target independent.
So, fundamentally wrong I think it is unlikely. If they profit from converting recursion into loops by explicitly modelling the state on the stack, I cannot tell (especially regarding performance).
As I can benchmark only on AArch64 in a suitable way, I prefer to enable this patch only for AArch64 and leave other architectures out for this initial version.

marels retitled this revision from [RecursionStackElimination]: Pass to eliminate recursions to [MultiTailCallElimination]: Pass to eliminate multiple tail calls.Dec 4 2018, 8:10 AM

john.brawn added inline comments.Dec 7 2018, 5:22 AM

lib/Transforms/Scalar/MultiTailCallElimination.cpp
1507–1513 ↗	(On Diff #176633)	This doesn't even compile. If I turn the condition into ((&BBI == E) \|\| (&BBI == TI)) or one of those two on their own I get an assertion failure on ++CandI because CandI is Candidates.end().

john.brawn added inline comments.Dec 7 2018, 7:54 AM

lib/Transforms/IPO/PassManagerBuilder.cpp
550–556	I would think that just before the TailCallEliminationPass would be the place to run this pass, and from experimentation doing it there we don't need to add any other passes around it.
lib/Transforms/Scalar/MultiTailCallElimination.cpp
405–407 ↗	(On Diff #176633)	It would be simpler to do RecursiveCall(const RecursiveCall &O) = delete.
1507–1513 ↗	(On Diff #176633)	I looks like maybe the while(true) should be while(CandI != Candidates.end()). With that the tests pass at least.
1626–1627 ↗	(On Diff #176633)	Should be 'else if' instead of nested 'if'. Also the curly braces are unnecessary here.

marels added inline comments.Dec 10 2018, 3:05 AM

lib/Transforms/Scalar/MultiTailCallElimination.cpp

1507–1513 ↗

(On Diff #176633)

I have no idea how I messed this up, but I did.

The issue is after my final tests, I did some cosmetics and added. Somehow a couple of lines got deleted. I found them in my local history. Big ups.

Although I am currently onto replying to the comments and will be included the fix the next update, please also find it below

diff --git a/lib/Transforms/Scalar/MultiTailCallElimination.cpp b/lib/Transforms/Scalar/MultiTailCallElimination.cpp
index ab4bff0..2dafdf1 100644
--- a/lib/Transforms/Scalar/MultiTailCallElimination.cpp
+++ b/lib/Transforms/Scalar/MultiTailCallElimination.cpp
@@ -1505,7 +1505,10 @@ private:

       // If we reached the terminator we are done.
       if (&*BBI == TI)
-        (&*BBI == E) {
+        break;
+
+      // The current search intervals end candidate is reached?
+      if (&*BBI == E) {
         StaticInst.push_back(&*BBI);
         // Advance the current iteration interval to next candidate.
         ++CandI;

john.brawn resigned from this revision.May 12 2020, 6:46 AM

Revision Contents

Path

Size

include/

llvm-c/

Transforms/

Scalar.h

3 lines

llvm/

InitializePasses.h

1 line

Transforms/

Scalar.h

8 lines

Scalar/

RecursionStackElimination.h

27 lines

lib/

Passes/

PassBuilder.cpp

7 lines

Transforms/

IPO/

PassManagerBuilder.cpp

6 lines

Scalar/

CMakeLists.txt

1 line

RecursionStackElimination.cpp

2158 lines

Scalar.cpp

5 lines

Diff 171096

include/llvm-c/Transforms/Scalar.h

Show First 20 Lines • Show All 117 Lines • ▼ Show 20 Lines	void LLVMAddScalarReplAggregatesPassWithThreshold(LLVMPassManagerRef PM,
int Threshold);		int Threshold);

/** See llvm::createSimplifyLibCallsPass function. */		/** See llvm::createSimplifyLibCallsPass function. */
void LLVMAddSimplifyLibCallsPass(LLVMPassManagerRef PM);		void LLVMAddSimplifyLibCallsPass(LLVMPassManagerRef PM);

/** See llvm::createTailCallEliminationPass function. */		/** See llvm::createTailCallEliminationPass function. */
void LLVMAddTailCallEliminationPass(LLVMPassManagerRef PM);		void LLVMAddTailCallEliminationPass(LLVMPassManagerRef PM);

		/** See llvm::createRecursionStackEliminationinationPass function. */
		void LLVMAddRecursionStackEliminationinationPass(LLVMPassManagerRef PM);

/** See llvm::createConstantPropagationPass function. */		/** See llvm::createConstantPropagationPass function. */
void LLVMAddConstantPropagationPass(LLVMPassManagerRef PM);		void LLVMAddConstantPropagationPass(LLVMPassManagerRef PM);

/** See llvm::demotePromoteMemoryToRegisterPass function. */		/** See llvm::demotePromoteMemoryToRegisterPass function. */
void LLVMAddDemoteMemoryToRegisterPass(LLVMPassManagerRef PM);		void LLVMAddDemoteMemoryToRegisterPass(LLVMPassManagerRef PM);

/** See llvm::createVerifierPass function. */		/** See llvm::createVerifierPass function. */
void LLVMAddVerifierPass(LLVMPassManagerRef PM);		void LLVMAddVerifierPass(LLVMPassManagerRef PM);
Show All 34 Lines

include/llvm/InitializePasses.h

	Show First 20 Lines • Show All 376 Lines • ▼ Show 20 Lines
	void initializeStripDeadPrototypesLegacyPassPass(PassRegistry&);			void initializeStripDeadPrototypesLegacyPassPass(PassRegistry&);
	void initializeStripDebugDeclarePass(PassRegistry&);			void initializeStripDebugDeclarePass(PassRegistry&);
	void initializeStripGCRelocatesPass(PassRegistry&);			void initializeStripGCRelocatesPass(PassRegistry&);
	void initializeStripNonDebugSymbolsPass(PassRegistry&);			void initializeStripNonDebugSymbolsPass(PassRegistry&);
	void initializeStripNonLineTableDebugInfoPass(PassRegistry&);			void initializeStripNonLineTableDebugInfoPass(PassRegistry&);
	void initializeStripSymbolsPass(PassRegistry&);			void initializeStripSymbolsPass(PassRegistry&);
	void initializeStructurizeCFGPass(PassRegistry&);			void initializeStructurizeCFGPass(PassRegistry&);
	void initializeTailCallElimPass(PassRegistry&);			void initializeTailCallElimPass(PassRegistry&);
				void initializeRecursionStackEliminationPass(PassRegistry&);
	void initializeTailDuplicatePass(PassRegistry&);			void initializeTailDuplicatePass(PassRegistry&);
	void initializeTargetLibraryInfoWrapperPassPass(PassRegistry&);			void initializeTargetLibraryInfoWrapperPassPass(PassRegistry&);
	void initializeTargetPassConfigPass(PassRegistry&);			void initializeTargetPassConfigPass(PassRegistry&);
	void initializeTargetTransformInfoWrapperPassPass(PassRegistry&);			void initializeTargetTransformInfoWrapperPassPass(PassRegistry&);
	void initializeThreadSanitizerPass(PassRegistry&);			void initializeThreadSanitizerPass(PassRegistry&);
	void initializeTwoAddressInstructionPassPass(PassRegistry&);			void initializeTwoAddressInstructionPassPass(PassRegistry&);
	void initializeTypeBasedAAWrapperPassPass(PassRegistry&);			void initializeTypeBasedAAWrapperPassPass(PassRegistry&);
	void initializeUnifyFunctionExitNodesPass(PassRegistry&);			void initializeUnifyFunctionExitNodesPass(PassRegistry&);
	Show All 16 Lines

include/llvm/Transforms/Scalar.h

	Show First 20 Lines • Show All 274 Lines • ▼ Show 20 Lines
	//			//
	// TailCallElimination - This pass eliminates call instructions to the current			// TailCallElimination - This pass eliminates call instructions to the current
	// function which occur immediately before return instructions.			// function which occur immediately before return instructions.
	//			//
	FunctionPass *createTailCallEliminationPass();			FunctionPass *createTailCallEliminationPass();

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
				// RecursionStackEliminationination - This pass eliminates 2 or more call
				// instructions to the current function which occur immediately before return
				// instructions, by explizitly modelling the a queue
				//
				FunctionPass *createRecursionStackEliminationPass();

				//===----------------------------------------------------------------------===//
				//
	// EarlyCSE - This pass performs a simple and fast CSE pass over the dominator			// EarlyCSE - This pass performs a simple and fast CSE pass over the dominator
	// tree.			// tree.
	//			//
	FunctionPass *createEarlyCSEPass(bool UseMemorySSA = false);			FunctionPass *createEarlyCSEPass(bool UseMemorySSA = false);

	//===----------------------------------------------------------------------===//			//===----------------------------------------------------------------------===//
	//			//
	// GVNHoist - This pass performs a simple and fast GVN pass over the dominator			// GVNHoist - This pass performs a simple and fast GVN pass over the dominator
	▲ Show 20 Lines • Show All 205 Lines • Show Last 20 Lines

include/llvm/Transforms/Scalar/RecursionStackElimination.h

This file was added.

				//===---- TODO ----------------------------- C++ --===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// TODO
				//
				//===----------------------------------------------------------------------===//

				#ifndef LLVM_TRANSFORMS_SCALAR_RECURSIONSTACKELIMINATION_H
				#define LLVM_TRANSFORMS_SCALAR_RECURSIONSTACKELIMINATION_H

				#include "llvm/IR/Function.h"
				#include "llvm/IR/PassManager.h"

				namespace llvm {

				struct RecursionStackEliminationPass : PassInfoMixin<RecursionStackEliminationPass> {
				PreservedAnalyses run(Function &F, FunctionAnalysisManager &AM);
				};
				}

				#endif // LLVM_TRANSFORMS_SCALAR_RECURSIONSTACKELIMINATION_H

lib/Passes/PassBuilder.cpp

	Show First 20 Lines • Show All 131 Lines • ▼ Show 20 Lines
	#include "llvm/Transforms/Scalar/LowerExpectIntrinsic.h"			#include "llvm/Transforms/Scalar/LowerExpectIntrinsic.h"
	#include "llvm/Transforms/Scalar/LowerGuardIntrinsic.h"			#include "llvm/Transforms/Scalar/LowerGuardIntrinsic.h"
	#include "llvm/Transforms/Scalar/MemCpyOptimizer.h"			#include "llvm/Transforms/Scalar/MemCpyOptimizer.h"
	#include "llvm/Transforms/Scalar/MergedLoadStoreMotion.h"			#include "llvm/Transforms/Scalar/MergedLoadStoreMotion.h"
	#include "llvm/Transforms/Scalar/NaryReassociate.h"			#include "llvm/Transforms/Scalar/NaryReassociate.h"
	#include "llvm/Transforms/Scalar/NewGVN.h"			#include "llvm/Transforms/Scalar/NewGVN.h"
	#include "llvm/Transforms/Scalar/PartiallyInlineLibCalls.h"			#include "llvm/Transforms/Scalar/PartiallyInlineLibCalls.h"
	#include "llvm/Transforms/Scalar/Reassociate.h"			#include "llvm/Transforms/Scalar/Reassociate.h"
				#include "llvm/Transforms/Scalar/RecursionStackElimination.h"
	#include "llvm/Transforms/Scalar/RewriteStatepointsForGC.h"			#include "llvm/Transforms/Scalar/RewriteStatepointsForGC.h"
	#include "llvm/Transforms/Scalar/SCCP.h"			#include "llvm/Transforms/Scalar/SCCP.h"
	#include "llvm/Transforms/Scalar/SROA.h"			#include "llvm/Transforms/Scalar/SROA.h"
	#include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"			#include "llvm/Transforms/Scalar/SimpleLoopUnswitch.h"
	#include "llvm/Transforms/Scalar/SimplifyCFG.h"			#include "llvm/Transforms/Scalar/SimplifyCFG.h"
	#include "llvm/Transforms/Scalar/Sink.h"			#include "llvm/Transforms/Scalar/Sink.h"
	#include "llvm/Transforms/Scalar/SpeculateAroundPHIs.h"			#include "llvm/Transforms/Scalar/SpeculateAroundPHIs.h"
	#include "llvm/Transforms/Scalar/SpeculativeExecution.h"			#include "llvm/Transforms/Scalar/SpeculativeExecution.h"
	▲ Show 20 Lines • Show All 205 Lines • ▼ Show 20 Lines

	FunctionPassManager			FunctionPassManager
	PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,			PassBuilder::buildFunctionSimplificationPipeline(OptimizationLevel Level,
	ThinLTOPhase Phase,			ThinLTOPhase Phase,
	bool DebugLogging) {			bool DebugLogging) {
	assert(Level != O0 && "Must request optimizations!");			assert(Level != O0 && "Must request optimizations!");
	FunctionPassManager FPM(DebugLogging);			FunctionPassManager FPM(DebugLogging);

				FPM.addPass(SROA()); // Try to get rid of allocas
				FPM.addPass(SimplifyCFGPass()); // Merge & remove BBs
				FPM.addPass(RecursionStackEliminationPass());
				FPM.addPass(ADCEPass()); // Delete dead instructions
				FPM.addPass(SimplifyCFGPass()); // Merge & remove BBs

	// Form SSA out of local memory accesses after breaking apart aggregates into			// Form SSA out of local memory accesses after breaking apart aggregates into
	// scalars.			// scalars.
	FPM.addPass(SROA());			FPM.addPass(SROA());

	// Catch trivial redundancies			// Catch trivial redundancies
	FPM.addPass(EarlyCSEPass(EnableEarlyCSEMemSSA));			FPM.addPass(EarlyCSEPass(EnableEarlyCSEMemSSA));

	// Hoisting of scalars and load expressions.			// Hoisting of scalars and load expressions.
	▲ Show 20 Lines • Show All 1,588 Lines • Show Last 20 Lines

lib/Transforms/IPO/PassManagerBuilder.cpp

Show First 20 Lines • Show All 540 Lines • ▼ Show 20 Lines	if (Inliner) {
RunInliner = true;		RunInliner = true;
}		}

MPM.add(createPostOrderFunctionAttrsLegacyPass());		MPM.add(createPostOrderFunctionAttrsLegacyPass());
if (OptLevel > 2)		if (OptLevel > 2)
MPM.add(createArgumentPromotionPass()); // Scalarize uninlined fn args		MPM.add(createArgumentPromotionPass()); // Scalarize uninlined fn args

addExtensionsToPM(EP_CGSCCOptimizerLate, MPM);		addExtensionsToPM(EP_CGSCCOptimizerLate, MPM);

		MPM.add(createSROAPass());
		MPM.add(createCFGSimplificationPass()); // Merge & remove BBs
		MPM.add(createRecursionStackEliminationPass());
		MPM.add(createAggressiveDCEPass()); // Delete dead instructions

addFunctionSimplificationPasses(MPM);		addFunctionSimplificationPasses(MPM);

		john.brawnUnsubmitted Not Done Reply Inline Actions I would think that just before the TailCallEliminationPass would be the place to run this pass, and from experimentation doing it there we don't need to add any other passes around it. john.brawn: I would think that just before the TailCallEliminationPass would be the place to run this pass…
// FIXME: This is a HACK! The inliner pass above implicitly creates a CGSCC		// FIXME: This is a HACK! The inliner pass above implicitly creates a CGSCC
// pass manager that we are specifically trying to avoid. To prevent this		// pass manager that we are specifically trying to avoid. To prevent this
// we must insert a no-op module pass to reset the pass manager.		// we must insert a no-op module pass to reset the pass manager.
MPM.add(createBarrierNoopPass());		MPM.add(createBarrierNoopPass());

if (RunPartialInlining)		if (RunPartialInlining)
MPM.add(createPartialInliningPass());		MPM.add(createPartialInliningPass());

▲ Show 20 Lines • Show All 510 Lines • Show Last 20 Lines

lib/Transforms/Scalar/CMakeLists.txt

Show First 20 Lines • Show All 49 Lines • ▼ Show 20 Lines	add_llvm_library(LLVMScalarOpts
MergedLoadStoreMotion.cpp		MergedLoadStoreMotion.cpp
NaryReassociate.cpp		NaryReassociate.cpp
NewGVN.cpp		NewGVN.cpp
PartiallyInlineLibCalls.cpp		PartiallyInlineLibCalls.cpp
PlaceSafepoints.cpp		PlaceSafepoints.cpp
Reassociate.cpp		Reassociate.cpp
Reg2Mem.cpp		Reg2Mem.cpp
RewriteStatepointsForGC.cpp		RewriteStatepointsForGC.cpp
		RecursionStackElimination.cpp
SCCP.cpp		SCCP.cpp
SROA.cpp		SROA.cpp
Scalar.cpp		Scalar.cpp
Scalarizer.cpp		Scalarizer.cpp
SeparateConstOffsetFromGEP.cpp		SeparateConstOffsetFromGEP.cpp
SimpleLoopUnswitch.cpp		SimpleLoopUnswitch.cpp
SimplifyCFGPass.cpp		SimplifyCFGPass.cpp
Sink.cpp		Sink.cpp
Show All 13 Lines

lib/Transforms/Scalar/RecursionStackElimination.cpp

This file was added.

				//===- RecursionStackElimination.cpp - Eliminate multiple tail recursions -===//
				//
				// The LLVM Compiler Infrastructure
				//
				// This file is distributed under the University of Illinois Open Source
				// License. See LICENSE.TXT for details.
				//
				//===----------------------------------------------------------------------===//
				//
				// This pass converts multiple tail recursive calls into a loop, by modeling the
				// calls as a single linked worklist explicitly on the stack.
				//
				// void f(a, b, const c) {
				// ...
				// if (...)
				// return
				//
				// a1, b1, b2, b3, x2, x3 = ...
				//
				// f(a1, b1, c) // 1st recursion
				// a2 = h(x2)
				// f(a2, b2, c) // 2nd recursion
				// a3 = h(x3)
				// f(a3, b3, c) // 3rd recursion
				// }
				//
				// ->
				//
				// void f(a, b, const c) {
				// struct { x, b, next } *worklist = null
				//
				// loop:
				// ...
				// if (...)
				// goto check_return
				//
				// a1, b1, b2, b3, x2, x3 = ...
				//
				// /* Assign arguments for the first call */
				//
				// a = a1
				// b = b1
				//
				// /* Put arguments of the remaining calls into the work list */
				//
				// queue = alloca(2 * sizeof(*worklist))
				// queue[0] = {x2, b2, &queue[1]}
				// queue[1] = {x3, b3, worklist}
				// worklist = queue
				//
				// goto loop
				//
				// check_return:
				// if (!worklist)
				// return
				// a = h(worklist->x)
				// b = worklist->b
				// worklist = worklist->next
				//
				// goto loop
				// }
				//
				// Such patterns occur for example when an application traverses k-ary full trees.
				//
				// The benefits of this transformation is, that neither frame nor link address
				// have to be stored on the stack. Also pass through arguments, like 'const c'
				// in example above, are less likely required to saved onto the stack, and if so
				// less often (only once in the entry block).
				//
				// The downsides are:
				//
				// a) An additional store for the worklist is required.
				//
				// b) The worst case required stack memory after this transformation depends on
				// the number of recursive calls, instead of the recursion depth.
				//
				// c) Additional conditional branches
				//
				// ad a) This point is compensated by avoiding storing the link address in each
				// call.
				//
				// ad b) This pass additionally can add (and does it by default) code to manage
				// unused blocks of allocas in a freelist. Because allocations are done in
				// blocks the freelist management, can also be done in blocks. This requires
				// the last item in each block to be marked, detectable by the code within
				// return. Currently the following mark algorithms are implemented:
				//
				// A further variable is stored on the stack containing the current
				// freelist. On the other hand by not executing a recursion the frame pointer
				// loads and stores are omitted.
				//
				// * TrueMarker: This algorithm marks each element and is applicable to binary
				// recursions.
				//
				// * FalseMarker: This algorithm marks no element, and is applicable when
				// freelist management is disabled.
				//
				// * FieldMarker: A separate bit is allocated in the worklist. This marker
				// requires additional instructions but can be used in the general case.
				//
				// * CompareMarker: Assuming allocas return memory addresses in a strictly
				// monotonic order. The freelist can be modeled to return the same order
				// when pulling elements from it. Comparing each worklist with worklist.next
				// can then reveal the information if the element is marked.
				//
				// * TaggedMarker: (not yet implemented) Similar to the FieldMarker this
				// marker marks the last item by a bit. But instead of using a separate bit
				// it uses Bit 0 of the worklist field. If the alignment of the worklist is
				// a power of 2, and if it is >= 2, this marker can also cover the general
				// case. It requires some additional bit masking but no additional memory
				// operations.
				//
				// ad c) The pass adds 1 conditional branch into the return path and 2
				// additional branches for freelist management (see (b) above). Depending on
				// the target, machine branch prediction can elevate this.
				//
				// Algorithm outline:
				// ------------------
				//
				// Analysis Phase:
				//
				// 1) Analyze the function and gather the returning basic blocks (BB) (and BB
				// branching to return-only BB) with recursive calls (RC).
				//
				// 2) If more the one BB or no RC is found abandon the transformation.
				//
				// 3) If the remaining BB has only one RC abandon the transformation. Otherwise
				// let N be number of RC in this BB.
				//
				// 4) Analyze the instructions in this BB and from the first RC until the
				// terminator instruction and classify each instruction as movable and static.
				//
				// A movable instruction is and instruction that can be safely moved before
				// the first RC. All other instructions are classified static.
				//
				// 5) Assign each static instruction to the following RC instruction. If static
				// instructions are left after the last RC abandon the transformation.
				//
				// 6) Build the function H with all its arguments for each RC. By including the
				// call itself in function H it is ensured that this function and enforcing
				// this function to return void. It is ensured that there are no escaping
				// values uses after the recursion.
				//
				// 6.1) Note: By the way step 4 is executed it is guaranteed that function H for
				// the first RC consists of a single instruction; the call itself. The first
				// call candidate is handled special (the same way as in Tail Recursion
				// Elimination (TRE)).
				//
				// 5,6) Note: The information collected on each RC is collected in the structure
				// RecursiveCall.
				//
				// 7) Compare the second's function H with all later ones. The behavior must
				// match, otherwise abandon the transformation.
				//
				// 7.1) Note: As the first RCs function H basically a TRE it can be ignored in this
				// step.
				//
				// 8) Simplify the argument list by removing constants and pass through arguments.
				//
				// 9) Decide whether it is profitable to use the transformation.
				//
				// Transformation Phase:
				//
				// 1) Adjust entry block and split of the loop entry.
				//
				// 2) Eliminate the first RC (similar to TRE).
				//
				// 3) Eliminate the remaining RC by allocating and filling an array new (or pick
				// it from the freelist) block of N-1 struct items. This array is put in the
				// front of the list. Pulling the list is added in (4). The execution of
				// function H is ensured in (5).
				//
				// 4) Create a new return block which pulls items from the worklist. If an and
				// of block marker is reached. The block is put into the freelist.
				//
				// 5) Add the instruction from function H into the return block and create the
				// loop.
				//
				// 6) Redirect each returning block into the new return block created in (4).
				//
				// 7) Drop the constant STACKGROWTHDIRECION. It is manly uses as a proof of
				// concept for Aarch64.
				//
				// Open issues and known TODOs - It would be great if reviewer could comment on
				// those as well:
				//
				// 1) Pipeline integration: Currently the pass is put before TRE. This includes
				// some supporting passes, for cleanup and preparation. This cannot be left
				// as is. The preferred way could be by adjusting
				// "AArch64TargetMachine::adjustPassManager" and only use it with
				// O3. AAarch64 is selected, because this is the architecture the author has
				// suitable benchmarking and test setups available. This was tried once (in a
				// similar way as in AMDGPUTargetMachine::adjustPassManager), however the
				// result was that may LLVM projects did not compile anymore because of
				// linker problems (Passes library was missing). Do you have any advise here?
				//
				// 2) The way to test if it profitable to use the pass needs adjustment. I think
				// that functions, that spill more registers have an increased chance to
				// profit, while functions that spill less, have lower chance for profit.
				//
				// 3) Thinking of a configurable way (maybe a separate marker class) to adjust
				// the way markers implemented. E.g.: putting the marker bit into the pointer
				// on Aarch64 show a significant performance boost (test implemented in C).
				//
				// 4) Is it safe to temporary create function and return no-changes after
				// deleting it? If not is there a better way to than calling using the
				// FunctionComparator?
				//
				// 5) GlobalsAA needs to preserved. Not sure about this in this context. Loads
				// and Stores are added here.
				//
				//===----------------------------------------------------------------------===//

				#include "llvm/Transforms/Scalar/RecursionStackElimination.h"
				#include "llvm/Pass.h"
				#include "llvm/Analysis/AliasAnalysis.h"
				#include "llvm/Analysis/GlobalsModRef.h"
				#include "llvm/Analysis/InstructionSimplify.h"
				#include "llvm/Analysis/Loads.h"
				#include "llvm/Analysis/MemorySSA.h"
				#include "llvm/Analysis/MemorySSAUpdater.h"
				#include "llvm/Analysis/OptimizationRemarkEmitter.h"
				#include "llvm/CodeGen/TargetFrameLowering.h"
				#include "llvm/Support/Debug.h"
				#include "llvm/Support/ErrorHandling.h"
				#include "llvm/Support/raw_ostream.h"
				#include "llvm/Transforms/Scalar.h"
				#include "llvm/Transforms/Utils/BasicBlockUtils.h"
				#include "llvm/Transforms/Utils/FunctionComparator.h"

				#include <cassert>
				#include <forward_list>

				using namespace llvm;

				#define DEBUG_TYPE "recstackelim"

				namespace {
				// TODO (correctness) fetch constant from TargetInfo. However this is a CodeGen
				// Information and not yet available. How to Solve?
				const TargetFrameLowering::StackDirection STACKDIRECTION = TargetFrameLowering::StackGrowsDown;


				/// Option to disable the pass.
				static cl::opt<bool>
				DisableRecursionStackElimination("disable-" DEBUG_TYPE, cl::init(false),
				cl::Hidden, cl::desc("Disable stack recursion elimination"));

				/// Option to disable freelist handling.
				static cl::opt<bool>
				DisableStackCleanup("disable-" DEBUG_TYPE "-stack-cleanup", cl::init(false),
				cl::Hidden, cl::desc("Disable stack recursion elimination"));

				/// Local implementation class.
				struct RecursionStackEliminationImpl {
				RecursionStackEliminationImpl(Function &F, AliasAnalysis &AA, OptimizationRemarkEmitter &ORE) :
				DL(F.getParent()->getDataLayout()),
				C(F.getParent()->getContext()),
				ORE(ORE),
				F(F),
				AA(AA) {}

				bool run() {
				// Check if feasible and profitable,
				if (analyzeFunction()) {
				// then eliminate the recursion.
				eliminateRecursion();
				return true;
				}

				return false;
				}

				private:
				/// Helper class to collect information on a single recursive call site. This
				/// class also stores the required properties of the function H (see
				/// introduction's description of this pass).
				struct RecursiveCall {
				private:
				/// Hide copy constructor
				RecursiveCall(const RecursiveCall &O) : Index(O.Index), F(O.F), H(O.H), CI(O.CI), Is() {
				llvm_unreachable("RecursiveCall cannot be copied");
				}

				public:
				RecursiveCall(CallInst CI, size_t Index) : Index(Index), F(CI->getParent()->getParent()), H(nullptr), CI(CI), Is() {}

				~RecursiveCall() {
				// If function H was already created, delete it. It is only used temporary.
				if (H)
				H->eraseFromParent();
				}

				/// Add an instruction to the list representing the function H.
				void addInstruction(Instruction *I) {
				Is.push_back(I);
				}

				/// Returns the recursive call instruction.
				CallInst *getCallInst() const {
				return CI;
				}

				/// Test if the value V is part of function H. Because H is not created these
				/// instructions are denoted "local"
				bool isLocalValue(Value V, unsigned Idx = nullptr) {
				auto LocalInstructionIt = std::find(Is.begin(), Is.end(), V);

				if (LocalInstructionIt == Is.end())
				return false;

				if (Idx)
				*Idx = LocalInstructionIt - Is.begin();

				return true;
				}

				/// Get the list operands of that are required for calling the function H
				const std::vector<Use*> &getIncomingOperands() const {
				return IncomingOperands;
				}

				/// If possible create the function H (including the final recursive call
				/// instruction). If it succeeds buildFunctionH returns true and assigns it to
				/// the member variable H. The created function is purely temporary. Its
				/// lifetime is bound to this object. The created function always returns void.
				bool buildFunctionH() {
				if (H)
				H->eraseFromParent();
				H = nullptr;

				// Get the arguments list from H
				SmallVector<Type*, 4> HArgs;
				for (auto &U: IncomingOperands)
				HArgs.push_back(U->get()->getType());

				// Get the function type and create it
				FunctionType *HTy = FunctionType::get(Type::getVoidTy(F.getContext()), makeArrayRef(HArgs), F.isVarArg());
				H = Function::Create(HTy, GlobalValue::PrivateLinkage, F.getAddressSpace(), F.getName() + ".H" + Twine(Index), F.getParent());
				BasicBlock *NewBB = BasicBlock::Create(H->getContext(), "entry", H);

				// Value-Value map to match instruction operands between the source and clone.
				std::map<Value, Value> VVMap;

				// Define a lambda used to clone an instruction. If leave constants is true
				// those are taken from the argument list if possible.
				auto cloneAndAppendInst = [this, &VVMap, NewBB] (Instruction *I, bool leaveConstants) {
				// We do not allow any uses outside the newly created function. Otherwise
				// the function would return some non void value.
				for (Use &U : I->uses()) {
				if (!isLocalValue(U.getUser()) && U.getUser() != CI) {
				H->eraseFromParent();
				H = nullptr;
				return false;
				}
				}

				// Clone the instruction, assign it to the newly created function and
				// adjust all operands.
				Instruction *NewI = I->clone();
				VVMap[I] = NewI;
				NewBB->getInstList().push_back(NewI);
				for (Use &U: NewI->operands()) {
				// Keep constant operands as is.
				if (leaveConstants && isa<Constant>(U.get()))
				continue;

				// Replace other operands by their counter part in the new function
				if (isLocalValue(U.get())) {
				auto VVMI = VVMap.find(U.get());
				if (VVMI == VVMap.end()) {
				H->eraseFromParent();
				H = nullptr;
				return false;
				}

				U.set(VVMI->second);
				}
				else {
				auto IOI = std::find_if(IncomingOperands.begin(), IncomingOperands.end(), [&U] (const Use* S) -> bool {
				return S->get() == U.get();
				});

				// If the operand is not found, ignore it if and only if it is a constant.
				if (IOI == IncomingOperands.end() && isa<Constant>(U.get()))
				continue;
				assert(IOI != IncomingOperands.end() && "Operand not in argument list");
				size_t ArgIdx = IOI - IncomingOperands.begin();
				U.set(H->arg_begin() + ArgIdx);
				}
				}

				return true;
				};
				// EOL - Lambda cloneAndAppendInst

				// Clone all instructions into the new function,
				for (auto &I: Is) {
				if (!cloneAndAppendInst(I, true))
				return false;
				}
				// but allow constant arguments for the recursive call inst.
				if (!cloneAndAppendInst(CI, false))
				return false;

				// Finally, append the terminator instruction.
				ReturnInst::Create(H->getContext(), NewBB);

				return true;
				}

				/// Return function H and create it if necessary.
				Function *getFunctionH() {
				if (!H) {
				buildFunctionH();
				}
				return H;
				}

				/// Remove duplicates within the operands while preserving the order of each
				/// first occurrence. The operands a given as use. For duplicates and
				/// arbitrary use is selected to be kept.
				void cleanupIncomingOperands() {
				std::map<Value, Use> map;
				std::vector<Use*> tmp(IncomingOperands.begin(), IncomingOperands.end());
				IncomingOperands.clear();
				for (auto &O: tmp) {
				bool inserted;
				std::tie(std::ignore, inserted) = map.insert(std::make_pair(O->get(), O));
				if (inserted)
				IncomingOperands.push_back(O);
				}
				}

				/// Collect all required operand for instruction I. If allowConstants is set,
				/// Constants are also put on the operand list.
				void analyzeOperands(User::op_range Operands, Instruction *I, bool allowConstants) {
				for (auto &U: Operands) {
				if (!allowConstants && isa<Constant>(U.get()))
				continue;

				// If the operand is a non-local value add it to the list. This covers all
				// movable instructions, as well as static instructions of other previous
				// candidates. For movable instructions this is feasible because they
				// instructions will remain (moved). On the other hand static instruction
				// are removed. Referring to those is invalid. An example is depicted in
				// the sequence below.
				//
				// ; Candidate 0
				// call @recursive(%donotcare)
				// ; ...
				// ;
				// ; Candidate n
				// %a = load %something
				// %b = load %somethingelse
				// call @recursive(%a)
				// ; ...
				// ;
				// ; Candidate m > n
				// call @recursive(%b)
				//
				// This case is not detected within this call. However, because the
				// function H of candidate n would have to return a non-void value, it is
				// covered during buildFunctioH.
				if (!isLocalValue(U.get())) {
				IncomingOperands.push_back(&U);
				}
				}
				}

				/// Analyze if this candidate can be feasible for this transformation.
				bool analyze() {
				IncomingOperands.clear();
				for (auto &I: Is) {
				// In general it is not safe to replace a constant by an arbitrary value
				// in the LLVM IR. allowConstants is there for set to false to avoid
				// putting them onto the argument list.
				analyzeOperands(I->operands(), I, false);
				}
				// As we know that all candidates are recursive only arg_operands are
				// analyzed for the recursive call instruction. Also for CallInst
				// arg_operands it is feasible and potentially profitable to replace
				// constants with values. E.g.:
				//
				// call @recursive(%this, 0);
				// call @recursive(%this, 1);
				// ...
				// call @recursive(%this, n);
				// ret
				analyzeOperands(CI->arg_operands(), CI, true);

				// Remove duplicates from operand list.
				cleanupIncomingOperands();

				// Try to encapsulate call and all static instruction into an new separate function.
				if (!buildFunctionH())
				return false;

				return true;
				}

				/// Test if O's function H behaves equally compared to the own function H.
				bool compareFunctionH(RecursiveCall *O) const {
				GlobalNumberState GN;
				return FunctionComparator(H, O->H, &GN).compare() == 0;
				}

				/// Erase the CallInst from the source function.
				void eraseSourceCallInst() {
				CI->eraseFromParent();
				CI = nullptr;
				}

				/// Erase all static instructions from the source function.
				void eraseSourceInstructions() {
				eraseSourceCallInst();
				for (auto i = Is.rbegin(), e = Is.rend(); i != e; /* inloop */) {
				(*(i++))->eraseFromParent();
				}
				Is.clear();
				}

				size_t getIndex() const {
				return Index;
				}

				private:
				/// Index, solely used for naming
				size_t Index;

				/// Recursive function the candidate is called within
				Function &F;

				/// Temporary function H (including the recursive call). This function is used
				/// during analysis and transformation. It is never called and destroyed when
				/// this object is deleted.
				Function *H;

				/// Recursive call instruction.
				CallInst *CI;

				/// Instructions that precede the current call.
				SmallVector<Instruction*, 4> Is;
				std::vector<Use*> IncomingOperands;
				};

				/// Helper class to collect and classify used arguments and analyze arguments
				/// used by function H.
				struct IncomingArgument {
				/// Stacked arguments are arguments that need to be modeled on the stack
				///
				/// Static arguments are arguments which are passed through to all
				/// recursive calls without changing position nor value.
				///
				/// e.g.:
				/// fn(a, b, c) {
				/// fn(a, b, c);
				/// fn(a + 1, a, c);
				/// }
				/// => StaticArguments = {c}
				///
				/// Constant arguments are Constants that are the equal for each call except the first.
				///
				/// UnknownKind are arguments which are not yet classified.
				enum Kind {
				StackedKind,
				StaticKind,
				ConstantKind,
				UnknownKind,
				};

				IncomingArgument() : Uses(), K(UnknownKind) {}

				/// Add an argument use and update the current classification.
				void addUse(const RecursiveCall CC, Use U, bool CheckOnly = false) {
				Value *V = U->get();
				switch (K) {
				case UnknownKind:
				K = StackedKind;
				if (isa<Constant>(V)) {
				K = ConstantKind;
				}
				else if (Argument *A = dyn_cast<Argument>(V)) {
				if (A == CC->getCallInst()->getArgOperand(A->getArgNo())) {
				K = StaticKind;
				}
				}
				Ty = V->getType();
				break;

				case StackedKind:
				// Keep Stacked as Stacked.
				break;

				case ConstantKind:
				// Change kind to stacked if the new argument is no constant or if the
				// constant value differs.
				if (Constant *C = dyn_cast<Constant>(V)) {
				if (C != getValue())
				K = StackedKind;
				}
				else {
				K = StackedKind;
				}
				break;

				case StaticKind:
				// Change kind to stacked if the new argument is no argument or if the
				// the position in the corresponding CallInst is unsuitable.
				if (Argument *A = dyn_cast<Argument>(V)) {
				if (A != CC->getCallInst()->getArgOperand(A->getArgNo())) {
				K = StackedKind;
				}
				}
				else {
				K = StackedKind;
				}
				break;
				}
				assert(Ty == V->getType() && "Argument type mismatch");
				if (!CheckOnly)
				Uses.insert(std::make_pair(CC, U));
				}

				/// Get the used value of a specific candidate, or the used value of any
				/// candidate.
				Value getValue(const RecursiveCall CC = nullptr) const {
				if (CC)
				return Uses.find(CC)->second->get();
				else
				return Uses.begin()->second->get();
				}

				/// Get the type of this argument.
				Type *getType() const {
				return Ty;
				}

				/// Returns true it this argument is of stacked kind.
				bool isStacked() const {
				assert(K != UnknownKind && "Argument kind is unknown");
				return K == StackedKind;
				}

				/// Returns true it this argument is of static kind.
				bool isStatic() const {
				assert(K != UnknownKind && "Argument kind is unknown");
				return K == StaticKind;
				}

				/// Returns true it this argument is of constant kind.
				bool isConstant() const {
				assert(K != UnknownKind && "Argument kind is unknown");
				return K == ConstantKind;
				}

				private:
				/// Candidate-Use map of the argument.
				std::map<const RecursiveCall , Use > Uses;

				/// Classification.
				Kind K;

				/// Argument type.
				Type *Ty;
				};

				struct TrueMarker {
				using ItemListContainer = SmallVectorImpl<Type*>;

				TrueMarker(LLVMContext &C, unsigned OtherCandidatesNumOf) : C(C), OtherCandidatesNumOf(OtherCandidatesNumOf) {}
				virtual ~TrueMarker() {}

				virtual bool mayPullFromFreeList() {
				// this is the default behavior
				return true;
				}

				virtual bool checkListType(Type *ListTy) {
				return true;
				}

				virtual void adjustStructFields(ItemListContainer &ItemList) {};

				virtual void addSetupMarkerInst(Value *LinkPtr, unsigned EntryIndex,
				Instruction *InsertPos, const DebugLoc &DebugLoc, const DataLayout &DL) {
				}

				virtual Value addComputeMarkInst(Value ListItem, Value *ListHead,
				Value NextPtr, Instruction InsertPos, const DebugLoc &DebugLoc) {
				return ConstantInt::getTrue(IntegerType::get(C, 1));
				}

				virtual unsigned getItemIndex(unsigned Item) {
				return Item;
				}

				LLVMContext &C;
				const unsigned OtherCandidatesNumOf;
				};

				struct FalseMarker : public TrueMarker {
				using base = TrueMarker;

				FalseMarker(LLVMContext &C, unsigned OtherCandidatesNumOf) : base(C, OtherCandidatesNumOf) {}

				bool mayPullFromFreeList() override {
				return false;
				}

				Value addComputeMarkInst(Value ListItem, Value *ListHead,
				Value NextPtr, Instruction InsertPos, const DebugLoc &DebugLoc) override {
				return ConstantInt::getFalse(IntegerType::get(C, 1));
				}
				};

				struct FieldMarker : public TrueMarker {
				using base = TrueMarker;

				FieldMarker(LLVMContext &C, unsigned OtherCandidatesNumOf) : base(C, OtherCandidatesNumOf), Index(-1) {}

				void adjustStructFields(ItemListContainer &ItemList) override {
				Index = ItemList.size();
				ItemList.push_back(IntegerType::get(C, 1));
				}

				void addSetupMarkerInst(Value *LinkPtr, unsigned EntryIndex,
				Instruction *InsertPos, const DebugLoc &DebugLoc, const DataLayout &DL) override {
				Value *Idx[2] = {
				ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), 0),
				ConstantInt::get(IntegerType::get(C, 32), Index) };

				Type *LinkTy = cast<PointerType>(LinkPtr->getType())->getElementType();
				Instruction *MarkPtrGEP = GetElementPtrInst::CreateInBounds(LinkTy,
				LinkPtr,
				makeArrayRef(Idx, 2),
				"rse.markptr." + Twine(EntryIndex),
				InsertPos);
				MarkPtrGEP->setDebugLoc(DebugLoc);

				Value *Mark;
				if (EntryIndex < OtherCandidatesNumOf - 1) {
				Mark = ConstantInt::getFalse(IntegerType::get(C, 1));
				}
				else {
				Mark = ConstantInt::getTrue(IntegerType::get(C, 1));
				}

				StoreInst *SMI = new StoreInst(Mark,
				MarkPtrGEP,
				false,
				DL.getABITypeAlignment(Mark->getType()),
				InsertPos);
				SMI->setDebugLoc(DebugLoc);
				}

				Value addComputeMarkInst(Value ListItem, Value *ListHead,
				Value NextPtr, Instruction InsertPos, const DebugLoc &DebugLoc) override {
				assert(Index != (unsigned)-1 && "Marker field index unknown");

				Instruction *Mark = ExtractValueInst::Create(
				ListItem, makeArrayRef(&Index, 1), "rse.mark", InsertPos);
				Mark->setDebugLoc(DebugLoc);
				return Mark;
				}

				unsigned Index;
				};

				struct CompareMarker : public TrueMarker {
				using base = TrueMarker;

				CompareMarker(LLVMContext &C, unsigned OtherCandidatesNumOf) : base(C, OtherCandidatesNumOf) {}

				unsigned getItemIndex(unsigned Item) override {
				if (STACKDIRECTION == TargetFrameLowering::StackGrowsDown)
				return OtherCandidatesNumOf - 1 - Item;
				else
				return Item;
				}

				Value addComputeMarkInst(Value ListItem, Value *ListHead,
				Value NextPtr, Instruction InsertPos, const DebugLoc &DebugLoc) override {
				Instruction *Mark;
				if (STACKDIRECTION == TargetFrameLowering::StackGrowsDown)
				// TODO (Correctness): What if NextPtr == nullptr, this is currently a
				// false negative which unnecessarily spills the stack. This case has to
				// be handled efficiently (e.g.: use (unsigned)-1 instead of nullptr,
				// iff this is valid on the target machine)
				Mark = new ICmpInst(InsertPos, CmpInst::ICMP_ULT, ListHead, NextPtr, "res.mark");
				else
				Mark = new ICmpInst(InsertPos, CmpInst::ICMP_UGT, ListHead, NextPtr, "res.mark");
				Mark->setDebugLoc(DebugLoc);
				return Mark;
				}
				};

				typedef std::forward_list<RecursiveCall> CandidateContainer;

				/// Current DataLayout.
				const DataLayout &DL;

				/// Current Context.
				LLVMContext &C;

				/// Optimization Remark Emitter to use.
				OptimizationRemarkEmitter &ORE;

				/// Function the pass is working on.
				Function &F;

				/// Alias Analysis of the current function.
				AliasAnalysis &AA;

				/// Type of list elements for the selected marker algorithm;
				StructType *ListTy;

				/// List of call candidates to be eliminated, this list contains the single
				/// first candidate.
				CandidateContainer FirstCandidate;

				/// List of the remaining call candidates to be eliminated.
				CandidateContainer OtherCandidates;

				/// Number of elements in the OtherCandidates List
				unsigned OtherCandidatesNumOf;

				/// List of arguments for the function H.
				SmallVector<IncomingArgument, 4> IncomingArguments;

				/// List of movable instructions.
				SmallVector<Instruction*, 16> MovableInst;

				/// List of static instructions.
				SmallVector<Instruction*, 16> StaticInst;

				/// Selected Marker Algorithm
				std::unique_ptr<TrueMarker> Marker;

				/// Find all recursive calls within the current basic block and add them to
				/// the list of candidates.
				///
				/// \returns true if at leas one recursive call is found.
				bool appendAllRecursiveCallsInBB(Function &F,
				BasicBlock &BB,
				unsigned &NumOf,
				CandidateContainer &Candidates,
				CandidateContainer::iterator &LastCandidate) {
				bool found = false;
				for (auto &I: BB) {
				CallInst *CI = dyn_cast<CallInst>(&I);
				if (CI && CI->getCalledFunction() == &F) {
				LastCandidate = Candidates.emplace_after(LastCandidate, CI, NumOf);
				NumOf++;
				found = true;
				}
				}
				return found;
				}

				/// Test if the given basic block is a return only BB.
				bool isReturnOnlyBB(const BasicBlock &BB) const {
				const ReturnInst *RI = dyn_cast<ReturnInst>(BB.getTerminator());
				return (RI != nullptr && BB.getFirstNonPHIOrDbg() == RI);
				}

				/// Return true if it is safe to move the specified
				/// instruction from after the call to before the call, assuming that all
				/// instructions between the call and this instruction are movable.
				///
				/// This is a copy of canMoveAboveCall within TailRecursionElimination
				///
				/// \returns true if the instruction can be moved.
				bool canMoveAbove(Instruction I, Instruction SI) {
				// FIXME: We can move load/store/call/free instructions above the call if the
				// call does not mod/ref the memory location being processed.
				if (I->mayHaveSideEffects()) // This also handles volatile loads.
				return false;

				if (LoadInst *L = dyn_cast<LoadInst>(I)) {
				// Loads may always be moved above calls without side effects.
				if (SI->mayHaveSideEffects()) {
				const DataLayout &DL = L->getModule()->getDataLayout();
				if (isModSet(AA.getModRefInfo(SI, MemoryLocation::get(L))) \|\|
				!isSafeToLoadUnconditionally(L->getPointerOperand(),
				L->getAlignment(), DL, L))
				return false;
				}
				}

				// Otherwise, if this is a side-effect free instruction, check to make sure
				// that it does not use the return value of the call. If it doesn't use the
				// return value of the call, it must only use things that are defined before
				// the call, or movable instructions between the call and the instruction
				// itself.
				return !is_contained(I->operands(), SI);
				}

				/// Analyze the current function and return true if it is feasible and
				/// profitable to apply the transformation.
				///
				/// \returns the number of candidates found.
				bool analyzeFunction() {
				// Number of basic blocks with all recursive calls
				int BBwithCandidates = 0;
				bool hasAllocas = false;

				CandidateContainer Candidates;
				auto LastCandidate = Candidates.before_begin();

				// We cannot handle variadic arguments.
				if (F.isVarArg())
				return false;

				// TODO (improvement): There might be ways to implement non void returns. For example
				// similar to accumulator TREs.
				if (F.getReturnType() != Type::getVoidTy(F.getContext()))
				return false;

				// No, can't be done.
				if (F.callsFunctionThatReturnsTwice())
				return false;

				unsigned CandidatesNumOf = 0;
				for (auto &BB: F) {
				// EH pads cannot be handled at all
				if (BB.isEHPad()) {
				return false;
				}

				// Alloca instructions are not allowed.
				//
				// TODO (improvement): Alloca can be theoretically be handled if it can be
				// proven that the used memory lifetime is bounded before the first call.
				// In this case the alloca can be moved into the new entry block the same
				// way tail recursion does.
				for (auto &I: BB) {
				if (isa<AllocaInst>(&I)) {
				hasAllocas = true;
				}
				}

				if (isa<ReturnInst>(BB.getTerminator())) {
				// For returning blocks find all candidates.
				if (appendAllRecursiveCallsInBB(F, BB, CandidatesNumOf, Candidates, LastCandidate)) {
				BBwithCandidates++;
				}
				}
				else if (BranchInst *BI = dyn_cast<BranchInst>(BB.getTerminator())) {
				// For non-returning blocks find all candidates, iff it has a single
				// successor which is a return only block.
				if (BI->isUnconditional() && isReturnOnlyBB(*BI->getSuccessor(0))) {
				if (appendAllRecursiveCallsInBB(F, BB, CandidatesNumOf, Candidates, LastCandidate)) {
				BBwithCandidates++;
				}
				}
				}

				// Currently only one BB with tail recursive calls is allowed.
				//
				// TODO (improvement): There might be interesting cases with recursive
				// calls in more then one BB.
				if (BBwithCandidates > 1) {
				ORE.emit([&]() {
				return OptimizationRemarkMissed(DEBUG_TYPE, "Recursion elimination missed",
				F.getEntryBlock().front().getDebugLoc(), &F.getEntryBlock())
				<< "Multiple Basic Blocks";
				});
				return false;
				}
				}

				// No RSE required. Leave to TRE.
				if (CandidatesNumOf < 2) {
				return false;
				}

				if (hasAllocas) {
				ORE.emit([&]() {
				return OptimizationRemarkMissed(DEBUG_TYPE, "Recursion elimination missed",
				F.getEntryBlock().front().getDebugLoc(), &F.getEntryBlock())
				<< "Functions containing alloca instructions not yet supported";
				});
				return false;
				}

				BasicBlock *BB = Candidates.front().getCallInst()->getParent();
				Instruction *TI = BB->getTerminator();

				// All instructions between the first candidate and the last candidate are
				// partitioned as static (unmovable) instructions and instructions (movable)
				// The resulting two arrays are ordered in the required execution sequence.
				//
				// This means that the execution order (MovableInst \| StaticInst) is legal
				// and equivalent to the original sequence.
				//
				// Algorithm:
				//
				// * Iterate over all instructions between Candidate[n] and Candidate[n+1].
				//
				// * Test if the instruction can be moved above all identified static
				// instructions.
				//
				// * If it can, add it to the movable instruction list, otherwise add it to
				// the static instruction list.
				//
				// * Consecutively add static instructions if necessary.

				// Iterate over all instruction the interval
				// )Candidate[x]. Candidate[x+1]). Virtually include the terminator as
				// additional candidate.
				//
				// Setup the first iteration interval between Candidate[0] and Candidate[1].
				// Note: The list has at least two members.
				auto Cand0 = Candidates.begin();
				auto Cand1 = next(Cand0);
				BasicBlock::iterator BBI(Cand0->getCallInst());
				auto CandI = Cand1;

				// The first candidate is always a single call which is a static instruction.
				StaticInst.push_back(&*BBI);

				while (true) {
				// Identity the instruction that marks the end of the current pair.
				//
				// It is either the CandI, or the terminator instruction (which is virtually added)
				Instruction *E;
				if (CandI != Candidates.end()) {
				E = CandI->getCallInst();
				}
				else {
				E = TI;
				}

				// advance the iterator first, because we start with the first call
				// instruction which is already added.
				++BBI;

				// If we reached the terminator we are done.
				if (&*BBI == TI)
				break;

				// The current search intervals end candidate is reached?
				if (&*BBI == E) {
				StaticInst.push_back(&*BBI);
				// Advance the current iteration interval to next candidate.
				++CandI;
				continue;
				}

				// Test if the instruction can be moved and add it accordingly.
				bool movable = true;
				for (auto &SI: StaticInst) {
				if (canMoveAbove(&BBI, &SI))
				continue;
				movable = false;
				break;
				}

				if (movable)
				MovableInst.push_back(&*BBI);
				else
				StaticInst.push_back(&*BBI);
				}

				// Assign all static instruction to the candidates. We do it naively by just
				// assigning the instruction in between to candidates to second one.
				auto SI = StaticInst.begin();
				for (auto &Cand: Candidates) {
				while (*SI != Cand.getCallInst()) {
				Cand.addInstruction(*SI);
				SI++;
				}
				// Skip candidate's call inst.
				SI++;
				}

				// If not all static instructions have been assigned, then there are unmovable tailing
				// instructions and the transformation is abandoned.
				if (SI != StaticInst.end()) {
				ORE.emit([&]() {
				return OptimizationRemarkMissed(DEBUG_TYPE, "Recursion elimination missed", *SI)
				<< "Instructions after last call.";
				});
				return false;
				}

				// Analyze each call candidate if the call and the corresponding static
				// instructions are suitable to be eliminated.
				for (auto &Cand: Candidates) {
				if (!Cand.analyze()) {
				ORE.emit([&]() {
				return OptimizationRemarkMissed(DEBUG_TYPE, "Recursion elimination missed", Cand.getCallInst())
				<< "RecursiveCall nonviable";
				});
				return false;
				}
				}

				// Test if the functions generated during analysis (step before) behave equivalently.
				for (auto CCI = std::next(Cand1), CCE = Candidates.end();
				CCI != CCE; ++CCI) {

				// Test if the function H(...) of both candidates match.
				if (!CCI->compareFunctionH(&*Cand1)) {
				ORE.emit([&]() {
				return OptimizationRemarkMissed(DEBUG_TYPE, "Recursion elimination missed", CCI->getCallInst())
				<< "RecursiveCall nonviable (Function H do not match)";
				});
				return false;
				}
				}

				// Analyze the operands for each function H and categorize them into
				// Stacked, Constant and Static.
				IncomingArguments.resize(Cand1->getIncomingOperands().size(), IncomingArgument());
				for (auto CCI = Cand1, CCE = Candidates.end(); CCI != CCE; ++CCI) {
				assert(IncomingArguments.size() == CCI->getIncomingOperands().size() && "Operand size mismatch");

				auto IAI = IncomingArguments.begin();
				for (Use *U: CCI->getIncomingOperands()) {
				IAI->addUse(&*CCI, U);

				// If the value is an Argument we need to check the position in the
				// first candidate too. It is sufficient here to just execute the checks.
				if (isa<Argument>(U->get())) {
				IAI->addUse(&*Cand0, U, true);
				}
				++IAI;
				}
				}

				// Count the categories.
				int NumStaticArguments = 0;
				int NumConstantArguments = 0;
				for (auto &IA: IncomingArguments) {
				if (IA.isConstant())
				NumConstantArguments++;
				if (IA.isStatic())
				NumStaticArguments++;
				}

				// Estimate costs and gain.
				if (IncomingArguments.size() - NumStaticArguments - NumConstantArguments + 2 > F.arg_size()) {
				LLVM_DEBUG(dbgs() << "Will not apply - too costly\n");
				ORE.emit([&]() {
				return OptimizationRemarkMissed(DEBUG_TYPE, "Recursion elimination missed", Cand0->getCallInst())
				<< "Too costly";
				});
				return false;
				}

				OtherCandidatesNumOf = CandidatesNumOf - 1;

				if (DisableStackCleanup) {
				Marker.reset(new FalseMarker(C, OtherCandidatesNumOf));
				}
				else {
				if (OtherCandidatesNumOf == 1) {
				Marker.reset(new TrueMarker(C, OtherCandidatesNumOf));
				}
				else {
				Marker.reset(new CompareMarker(C, OtherCandidatesNumOf));
				//Marker.reset(new FieldMarker(C, OtherCandidatesNumOf));
				}
				}

				ListTy = createListType();
				if (!Marker->checkListType(ListTy)) {
				return false;
				}

				// Splice the list into the two output lists. One contains the first element
				// only, the other containing the remaining elements.
				OtherCandidates.splice_after(OtherCandidates.before_begin(), Candidates, Candidates.begin(), Candidates.end());
				FirstCandidate.splice_after(FirstCandidate.before_begin(), Candidates);

				return true;
				}

				/// Creates the worklist structure used to model the pending call list.
				StructType *createListType() const {
				// Collect the types of the operands
				SmallVector<Type*,4> ListElements;
				for (auto &IA: IncomingArguments) {
				if (IA.isStacked())
				ListElements.push_back(IA.getType());
				}

				// create a new StructType representing the list.
				StructType *ListTy = StructType::create(C, "rse.listty");
				StructType *OperandsTy = StructType::get(C, makeArrayRef(ListElements), false);

				SmallVector<Type*, 2> LinkItems;
				LinkItems.push_back(ListTy->getPointerTo());
				Marker->adjustStructFields(LinkItems);
				StructType *LinkTy = StructType::get(C, makeArrayRef(LinkItems), false);

				Type *ListItemTys[2] = { OperandsTy, LinkTy };
				ListTy->setBody(makeArrayRef(ListItemTys, 2), false);

				return ListTy;
				}

				/// Prepare the entry block by splitting of the loop entry. The following
				/// PHINodes are added to the loop entry.
				///
				/// * One for each argument (Also the uses are adjusted).
				/// * List head representing the root of the worklist.
				/// * If enabled the root to the free list.
				template <typename ArgumentPHIContainer>
				std::tuple<PHINode*,
				PHINode*,
				BasicBlock*>
				prepareEntry(ArgumentPHIContainer &ArgumentPHIs) {
				const DebugLoc &DebugLoc = FirstCandidate.front().getCallInst()->getDebugLoc();

				// Split the entry block at the first viable instruction.
				BasicBlock *Entry = &F.getEntryBlock();
				BasicBlock *LoopEntry = Entry->splitBasicBlock(Entry->getFirstNonPHI(), "rse.loop");
				Entry->getTerminator()->setDebugLoc(DebugLoc);

				// Setup List PHI nodes for the loop entry.
				Instruction *InsertPos = &LoopEntry->front();
				PHINode *ListHead = PHINode::Create(ListTy->getPointerTo(), 2,
				"rse.list",
				InsertPos);
				ListHead->addIncoming(ConstantPointerNull::get(ListTy->getPointerTo()), Entry);

				/// Setup the FreeList if necessary.
				PHINode *FreeList = nullptr;
				FreeList = PHINode::Create(ListTy->getPointerTo(), 2,
				"rse.freelist",
				InsertPos);
				FreeList->addIncoming(ConstantPointerNull::get(ListTy->getPointerTo()), Entry);

				// Setup Argument PHI nodes for the loop entry - see TRE.
				for (Function::arg_iterator I = F.arg_begin(), E = F.arg_end();
				I != E; ++I) {
				PHINode *PN = PHINode::Create(I->getType(), 2,
				"res." + I->getName(), InsertPos);
				I->replaceAllUsesWith(PN); // Everyone use the PHI node now!
				PN->addIncoming(&*I, Entry);
				ArgumentPHIs.push_back(PN);
				}

				return std::make_tuple(ListHead, FreeList, LoopEntry);
				}

				/// Adjust all ArgumentPHIs according to the CallInst to be eliminated.
				///
				/// Arguments are taken from CI. IncomingBB is used as the BB to add into to
				/// PHI list.
				template <typename ArgumentPHIContainer>
				void adjustPHIsFromCallInst(
				BasicBlock IncomingBB, CallInst CI,
				const ArgumentPHIContainer &ArgumentPHIs) {
				// Adjust PHI Nodes in the loop entry
				CallInst::op_iterator Arg = CI->arg_begin();
				for (auto &PHI: ArgumentPHIs) {
				assert(Arg != CI->arg_end() && "Callsite does not provide enough arguments.");
				PHI->addIncoming(Arg->get(), IncomingBB);
				Arg++;
				}
				assert(Arg == CI->arg_end() && "Callsite provided too many arguments.");
				}

				/// Eliminates the first call just like it is done within tail call
				/// elimination. Additionally it moves all movable instructions before the
				/// first call candidate, and it splits the remaining calls into a separate
				/// basic block. ArgumentPHIs are adjusted accordingly.
				//
				// calling.bb:
				// ...
				// call @recursive ...
				//
				// %movable1 = add ...
				// %static1 = call @fn1, %movable1
				// call @recursive ...
				//
				// %movable2 = add ...
				// %static2 = call @fn1, %movable2
				// call @recursive ...
				//
				// br ret
				//
				// ->
				//
				// calling.bb:
				// ...
				// %movable1 = add ...
				// %movable2 = add ...
				// br %loopentry
				//
				// H1:
				//
				// %static1 = call @fn1, %movable1
				// call @recursive ...
				//
				// %static2 = call @fn1, %movable2
				// call @recursive ...
				//
				// br ret
				template <typename ArgumentPHIContainer>
				std::tuple<BasicBlock, BasicBlock>
				eliminateFirstCall(RecursiveCall &CC,
				BasicBlock *LoopEntry,
				const ArgumentPHIContainer &ArgumentPHIs) {
				CallInst *CI = CC.getCallInst();

				// Move all moveable instruction before this call instruction.
				for (auto &I: MovableInst)
				I->moveBefore(CI);

				// Split off the remaining static (unmovable instructions).
				BasicBlock *CallerBB = CI->getParent();
				BasicBlock *StaticInstBB = CallerBB->splitBasicBlock(CI, "rse.H1");
				BranchInst *BI = cast<BranchInst>(CallerBB->getTerminator());

				// Create the loop to eliminate the call the the first candidate.
				BI->setSuccessor(0, LoopEntry);
				BI->setDebugLoc(CI->getDebugLoc());

				adjustPHIsFromCallInst(CallerBB, CI, ArgumentPHIs);
				CC.eraseSourceCallInst();

				return std::make_tuple(CallerBB, StaticInstBB);
				}

				/// Adds code to bypass the alloca (inserting by eliminateOtherCalls) with
				/// code to picking it from the freelist.
				//
				// calling.bb:
				// ...
				// %movable.0 = add ...
				// %movable.k = add ...
				//
				// ; allocate memory on stack
				// %itemptr = alloca %listty, 2
				//
				// ; compute address if list items
				// %gep.0 = getelementptr %listty, 0
				// ...
				// %gep.k = getelementptr %listty, k
				//
				// ; setup and store list item [0]; set mark to false
				// mark.0 = <marker instructions>
				// %nextptr.0 = getelementptr %listty, itemptr, 0, 1, 0
				// store %gep.1, *nextptr.0
				// ...
				// mark.k = <marker instructions>
				// %nextptr.k = getelementptr %listty, itemptr, k, 1, 0
				// store %listhead, %nextptr.k
				//
				// ; setup and store operands entries
				// %opreands.0 = getelementptr %listty, itemptr, 0, 0
				// %arg.0-0 = insertvalue mark_0, undef, %movable.0-0
				// store %arg.0-0, %operands.0
				// ...
				//
				// -> [AI = %itemptr, AllocaEnd = %nextptr.k, NewHeadPtr = %gep.0]
				//
				// calling.bb:
				// ...
				// %movable.0 = add ...
				// %movable.k = add ...
				//
				// alloca:
				// ; allocate memory on stack
				// %itemptr = alloca %listty, 2
				//
				// ; compute address if list items
				// %gep.0 = getelementptr %listty, 0
				// ...
				// %gep.k = getelementptr %listty, k
				//
				// ; setup and store list item [0]; set mark to false
				// mark.0 = <marker instructions>
				// %nextptr.0 = getelementptr %listty, %itemptr, 0, 1, 0
				// store %gep.1, *nextptr.0
				// ...
				// mark.k = <marker instructions>
				//
				// allocfromfree:
				// %gep.0.fromfree = getelementptr %listty, %freelist, ...
				// %nextptr = getelementptr %listty, 0, #STRUCT_IDX_OFNEXT
				// %next = load %nextptr
				// br calling.bb.1
				//
				// calling.bb.1:
				// %newheadptrphi = phi %listty, [%gep.0, %allocaa], [%gep.0.fromfree.fromfree, %allocfromfree]
				// %allocated = phi %listty, [%itemptr, %alloca], [%freelist, %allocfromfree]
				// %newfreelist = phi %listty, [%freelist, %alloca], [%next, %allocfromfree]
				//
				// %nextptr.k = getelementptr %listty, %allocated, k, 1, 0
				// store %listhead, %nextptr.k
				//
				// ; setup and store operands entries
				// %opreands.0 = getelementptr %listty, %allocated, 0, 0
				// %arg.0-0 = insertvalue mark_0, undef, %movable.0-0
				// store %arg.0-0, %operands.0
				// ...
				Instruction replaceAllocaWithFreeList(AllocaInst AI, Instruction AllocaEnd, Instruction NewHeadPtr, PHINode *FreeList) {
				// Never generate code to pull from free list if not necessary
				if (!Marker->mayPullFromFreeList()) {
				FreeList->addIncoming(FreeList, AI->getParent());
				return NewHeadPtr;
				}

				const DebugLoc &DebugLoc = AI->getDebugLoc();
				BasicBlock *PreCallingBB = AI->getParent();
				BasicBlock::iterator InsertPos(AI);

				BasicBlock *AllocaBB = PreCallingBB->splitBasicBlock(InsertPos, "rse.alloca");

				InsertPos = BasicBlock::iterator(AllocaEnd);
				BasicBlock *CallingBB = AllocaBB->splitBasicBlock(InsertPos, PreCallingBB->getName() + ".1");

				PHINode *NewHeadPHI = PHINode::Create(ListTy->getPointerTo(), 2, "res.newhead", CallingBB->getFirstNonPHI());
				NewHeadPHI->addIncoming(NewHeadPtr, AllocaBB);

				BasicBlock *PullBB = BasicBlock::Create(C, "rse.allocfromfree", &F, CallingBB);

				// compute of last next ptr
				Value *IdxNext[] = {
				ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), Marker->getItemIndex(0)),
				ConstantInt::get(IntegerType::get(C, 32), 1),
				ConstantInt::get(IntegerType::get(C, 32), 0),
				};
				Instruction *PulledItem0 = GetElementPtrInst::CreateInBounds(ListTy,
				FreeList,
				makeArrayRef(IdxNext, 1),
				"rse.pulleditemgep.0",
				PullBB);
				PulledItem0->setDebugLoc(DebugLoc);
				NewHeadPHI->addIncoming(PulledItem0, PullBB);

				IdxNext[0] = ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), Marker->getItemIndex(OtherCandidatesNumOf - 1));
				Instruction *NewFreePtr = GetElementPtrInst::CreateInBounds(ListTy,
				FreeList,
				makeArrayRef(IdxNext, 3),
				"rse.freelist.nextptr",
				PullBB);
				NewFreePtr->setDebugLoc(DebugLoc);

				LoadInst *NewFree = new LoadInst(NewFreePtr,
				"rse.newfreelist",
				false,
				DL.getABITypeAlignment(ListTy->getPointerTo()),
				PullBB);
				NewFree->setDebugLoc(DebugLoc);

				BranchInst *BI = BranchInst::Create(CallingBB, PullBB);
				BI->setDebugLoc(DebugLoc);

				InsertPos = BasicBlock::iterator(CallingBB->getFirstNonPHI());
				PHINode *AllocatedPHI = PHINode::Create(ListTy->getPointerTo(), 2,
				"rse.allocated",
				&*InsertPos);
				AllocatedPHI->setDebugLoc(DebugLoc);

				// replace all remaining uses of AI with the new PHInode
				for (auto i = AI->use_begin(), e = AI->use_end(); i != e; /* in loop */) {
				Use &U = *(i++);
				if (Instruction *I = dyn_cast<Instruction>(U.getUser())) {
				if (I->getParent() == CallingBB)
				U.set(AllocatedPHI);
				}
				}

				AllocatedPHI->addIncoming(AI, AllocaBB);
				AllocatedPHI->addIncoming(FreeList, PullBB);

				PHINode *UpdatedFreeList = PHINode::Create(ListTy->getPointerTo(), 2,
				"rse.freelist.next",
				&*InsertPos);
				UpdatedFreeList->setDebugLoc(DebugLoc);
				UpdatedFreeList->addIncoming(FreeList, AllocaBB);
				UpdatedFreeList->addIncoming(NewFree, PullBB);
				FreeList->addIncoming(UpdatedFreeList, UpdatedFreeList->getParent());

				InsertPos = BasicBlock::iterator(PreCallingBB->getTerminator());
				ICmpInst CmpI = new ICmpInst(&InsertPos,
				ICmpInst::ICMP_EQ,
				FreeList,
				ConstantPointerNull::get(ListTy->getPointerTo()),
				"rse.cmp_freelist_empty");
				CmpI->setDebugLoc(DebugLoc);

				BI = BranchInst::Create(AllocaBB, PullBB, CmpI);
				BI->setDebugLoc(DebugLoc);
				ReplaceInstWithInst(PreCallingBB->getTerminator(), BI);

				return NewHeadPHI;
				}

				/// Eliminate other recursive calls by adding the structures to the itemlist.
				//
				// calling.bb:
				// ...
				// %movable.0 = add ...
				// %movable.k = add ...
				// br %loopentry
				//
				// ->
				//
				// calling.bb:
				// ...
				// %movable.0 = add ...
				// %movable.k = add ...
				//
				// ; allocate memory on stack
				// %itemptr = alloca %listty, 2
				//
				// ; compute address if list items
				// %gep_0 = getelementptr %listty, 0
				// ...
				// %gep_k = getelementptr %listty, k
				//
				// ; setup and store list item [0]; set mark to false
				// mark.0 = <marker instructions>
				// %nextptr.0 = getelementptr %listty, itemptr, 0, 1, 0
				// store %gep.1, *nextptr.0
				// ...
				// mark.k = <marker instructions>
				// %nextptr.k = getelementptr %listty, itemptr, k, 1, 0
				// store %listhead, %nextptr.k
				//
				// ; setup and store operands entries
				// %opreands.0 = getelementptr %listty, itemptr, 0, 0
				// %arg.0-0 = insertvalue mark_0, undef, %movable.0-0
				// store %arg.0-0, %operands.0
				// ...
				// %opreands.k = getelementptr %listty, itemptr, k, 0
				// %arg.k-0 = insertvalue mark_1, undef, %movable.k-0
				// store %arg.k-0, %operands.k
				//
				// br %loopentry
				//
				// ; below a reference to the source instruction
				//
				// H1:
				//
				// %static.0 = call @fn1, %movable.0
				// call @recursive ...
				//
				// %static.k = call @fn1, %movable.k
				// call @recursive ...
				//
				// br ret
				void eliminateOtherCalls(CandidateContainer &OtherCandidates,
				BasicBlock *CallerBB,
				PHINode *ListHead,
				PHINode *FreeList) {
				Instruction *InsertPos = CallerBB->getTerminator();
				Type *OperandsTy = ListTy->getTypeAtIndex(0U);

				// Add an entry to the list for all candidates except the first. This to
				// eliminates the calls to those candidates.
				AllocaInst *AllocaPtr = new AllocaInst(ListTy,
				DL.getAllocaAddrSpace(),
				ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), OtherCandidatesNumOf),
				DL.getABITypeAlignment(ListTy),
				"rse.itemptr",
				InsertPos);
				AllocaPtr->setDebugLoc(InsertPos->getDebugLoc());

				SmallVector<Instruction*, 4> ItemGEPs;
				ItemGEPs.reserve(OtherCandidatesNumOf);
				// OperandsGEPs.reserve(OtherCandidatesNumOf);
				// LinkGEPs.reserve(OtherCandidatesNumOf);
				unsigned idx = 0;
				for (auto &Cand: OtherCandidates) {
				Value *Idx = ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), Marker->getItemIndex(idx));
				Instruction *ItemGEP = GetElementPtrInst::CreateInBounds(ListTy,
				AllocaPtr,
				makeArrayRef(&Idx, 1),
				"rse.itemgep." + Twine(idx),
				InsertPos);
				ItemGEP->setDebugLoc(Cand.getCallInst()->getDebugLoc());
				ItemGEPs.push_back(ItemGEP);

				idx++;
				}

				// Fill list link structure with the necessary values. With a freelist
				// available all values except the last nextptr are constant during this
				// execution instance. The address values are all relative to the allocated
				// address, and the last mark (if any) also do not need to change when
				// pulling the element from the freelist. AllocaEnd marks the first
				// instruction that must be executed when pulling from the freelist.
				Instruction *AllocaEnd = nullptr;
				idx = 0;
				for (auto &Cand: OtherCandidates) {
				const DebugLoc &DebugLoc = Cand.getCallInst()->getDebugLoc();

				Value *Idx[3] = {
				ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), Marker->getItemIndex(idx)),
				ConstantInt::get(IntegerType::get(C, 32), 1),
				ConstantInt::get(IntegerType::get(C, 32), 0) };

				Instruction *LinkPtrGEP = GetElementPtrInst::CreateInBounds(ListTy,
				AllocaPtr,
				makeArrayRef(Idx, 2),
				"rse.linkptr." + Twine(idx),
				InsertPos);
				LinkPtrGEP->setDebugLoc(Cand.getCallInst()->getDebugLoc());
				Marker->addSetupMarkerInst(LinkPtrGEP, idx, InsertPos, DebugLoc, DL);

				Instruction *NextPtrGEP = GetElementPtrInst::CreateInBounds(ListTy,
				AllocaPtr,
				makeArrayRef(Idx, 3),
				"rse.nextptr." + Twine(idx),
				InsertPos);
				NextPtrGEP->setDebugLoc(Cand.getCallInst()->getDebugLoc());

				Value *NextVal;
				if (idx < OtherCandidatesNumOf - 1) {
				NextVal = ItemGEPs[idx+1];
				}
				else {
				NextVal = ListHead;
				AllocaEnd = NextPtrGEP;
				}

				StoreInst *SNPI = new StoreInst(NextVal,
				NextPtrGEP,
				false,
				DL.getABITypeAlignment(NextVal->getType()),
				InsertPos);
				SNPI->setDebugLoc(DebugLoc);

				idx++;
				}

				// Fill the operands part of the ListItems
				idx = 0;
				for (auto &Cand: OtherCandidates) {
				const DebugLoc &DebugLoc = Cand.getCallInst()->getDebugLoc();

				unsigned ValueIdx = 0;
				Instruction *OperandsVal = nullptr;
				for (auto &IA: IncomingArguments) {
				if (!IA.isStacked())
				continue;

				Value *ValueToStore = IA.getValue(&Cand);

				OperandsVal = InsertValueInst::Create((OperandsVal ? cast<Value>(OperandsVal) : cast<Value>(UndefValue::get(OperandsTy))),
				ValueToStore,
				makeArrayRef(&ValueIdx, 1),
				"rse.arg." + Twine(idx) + ".Op" + Twine(ValueIdx),
				InsertPos);
				OperandsVal->setDebugLoc(DebugLoc);
				ValueIdx++;
				}

				// OperandsVal might be null if no operands have to be stored in the list.
				if (OperandsVal) {
				// Create the GEP that points to the operand struct
				Value *Idx[2] = {
				ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), Marker->getItemIndex(idx)),
				ConstantInt::get(IntegerType::get(C, 32), 0) };
				Instruction *OperandsGEP = GetElementPtrInst::CreateInBounds(ListTy,
				AllocaPtr,
				makeArrayRef(Idx, 2),
				"rse.operandsgep." + Twine(idx),
				InsertPos);
				OperandsGEP->setDebugLoc(Cand.getCallInst()->getDebugLoc());
				// OperandsGEPs.push_back(OperandsGEP);

				StoreInst *SOVI = new StoreInst(OperandsVal,
				OperandsGEP,
				false,
				DL.getABITypeAlignment(OperandsTy),
				InsertPos);
				SOVI->setDebugLoc(DebugLoc);
				}

				idx++;
				}

				Instruction *NewHead = replaceAllocaWithFreeList(AllocaPtr, AllocaEnd, ItemGEPs[0], FreeList);

				// Redirect list head by, adjusting listhead PHI Node in the loop entry.
				ListHead->addIncoming(NewHead, NewHead->getParent());
				}

				/// Create a new return block which pulls from the list. It loops until the
				/// list is empty. It returns if and only if the list is empty.
				///
				/// This function splits the terminator instructions from the basic block
				/// currently containing the remaining static instructions.
				//
				// H1:
				//
				// %static1 = call @fn1, %movable1
				// call @recursive ...
				//
				// %static2 = call @fn1, %movable2
				// call @recursive ...
				//
				// br ret
				//
				// ->
				//
				// pull:
				// %listempty = icmp %listhead, null
				// br %listempty, %return, %H1
				//
				// H1:
				// %static1 = call @fn1, %movable1
				// call @recursive ...
				//
				// %static2 = call @fn1, %movable2
				// call @recursive ...
				//
				// br return
				//
				// return:
				// br ret
				//
				std::tuple<BasicBlock, BasicBlock, BasicBlock*>
				createNewReturnBlock(BasicBlock StaticInstBB, RecursiveCall CC, PHINode *ListHead) {
				CallInst *CI = CC->getCallInst();
				const DebugLoc &DebugLoc = CI->getDebugLoc();

				// If the block is a non-returning blocks then the following block is the a
				// return only block and the branch is an unconditional one.
				if (BranchInst *BI = dyn_cast<BranchInst>(StaticInstBB->getTerminator())) {
				BasicBlock *RetOnlyBB = StaticInstBB->getSingleSuccessor();
				assert(RetOnlyBB && "Recursion has more than one successor");
				Instruction *NewTI = RetOnlyBB->getTerminator()->clone();
				NewTI->insertBefore(BI);
				NewTI->setDebugLoc(BI->getDebugLoc());
				BI->eraseFromParent();
				}

				// Create the BB which pulls an item from the list.
				BasicBlock *PullBB = BasicBlock::Create(C, "rse.pull", &F, StaticInstBB);

				// Split off the terminator instruction. This becomes the returning
				// basic block.
				BasicBlock *NewReturnBB = StaticInstBB->splitBasicBlock(
				StaticInstBB->getTerminator(), "rse.return");
				NewReturnBB->getTerminator()->setDebugLoc(DebugLoc);

				ICmpInst CmpI = new ICmpInst(PullBB,
				ICmpInst::ICMP_EQ,
				ListHead,
				ConstantPointerNull::get(ListTy->getPointerTo()),
				"rse.cmp_list_empty");
				CmpI->setDebugLoc(DebugLoc);

				BranchInst *BI = BranchInst::Create(NewReturnBB, StaticInstBB, CmpI, PullBB);
				BI->setDebugLoc(DebugLoc);

				return std::make_tuple(PullBB, StaticInstBB, NewReturnBB);
				}

				/// Transforms the path which does not return on the old return such that it
				/// pick the information from the itemlist, and loops back to the loopentry
				/// after executing the relevant parts of the function H.
				//
				// H1:
				// ; load data from list
				// %listitem = load %listhead
				//
				// ; extract data from list (mark is omitted if necessary)
				// %next = extractvalue %listitem, 1, 0
				// ...
				//
				// -> [instructions are inserted below %next]
				//
				// H1:
				// ; load data from list
				// %listitem = load %listhead
				//
				// ; extract data from list (mark is omitted if necessary)
				// %next = extractvalue %listitem, 1, 0
				// %mark = <instructions from mark algorithm>
				// br %mark, %addtofree, %H2
				//
				// addtofree:
				// %firstinblockptr = getelementptr %listty, %listhead, -1
				// %freenextptr = getelementptr %listty, %listhead, -1, 1, 0
				// store %freelist, %freenextptr
				// br %H2
				//
				// H2:
				// %newfree - phi [%freelist, %H1], [%firstinblock, %addtofree]
				//
				// ...
				//
				void insertAddToFreeListBB(Value ListItem, Instruction NextPtr, PHINode ListHead, PHINode FreeList,
				const DebugLoc &DebugLoc) {
				BasicBlock *H1 = NextPtr->getParent();
				BasicBlock::iterator InsertPos(NextPtr);
				++InsertPos;
				unsigned LinkItemIdx = 1;
				Instruction *LinkItemVal = ExtractValueInst::Create(
				ListItem, makeArrayRef(&LinkItemIdx, 1), "rse.linkval", &*InsertPos);
				LinkItemVal->setDebugLoc(DebugLoc);
				Value Mark = Marker->addComputeMarkInst(LinkItemVal, ListHead, NextPtr, &InsertPos, DebugLoc);

				if (ConstantInt *MarkerConst = dyn_cast<ConstantInt>(Mark)) {
				// Nothing to do marker is constant false
				if (MarkerConst->isZero()) {
				// Adjust Freelist PHI
				FreeList->addIncoming(FreeList, InsertPos->getParent());
				return;
				}
				}

				// Compute address of bulk allocation.
				signed IdxOfFirstItem = -(signed)Marker->getItemIndex(OtherCandidatesNumOf - 1);
				//signed IdxOfFirstItem = -(signed)getItemIndex(OtherCandidatesNumOf - 1);
				Value *IdxBulk[] = {
				ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), IdxOfFirstItem),
				};
				Instruction *NewFreeHead = GetElementPtrInst::CreateInBounds(ListTy,
				ListHead,
				makeArrayRef(IdxBulk, 1),
				"rse.firstinblockptr",
				&*InsertPos);
				NewFreeHead->setDebugLoc(DebugLoc);

				// Adjust Freelist PHI
				FreeList->addIncoming(NewFreeHead, InsertPos->getParent());

				// If freelist handling is conditional, surround the generating function by
				// a basic block and execute it conditionally
				if (Instruction *MarkerInst = dyn_cast<Instruction>(Mark)) {
				// Split of a basic block and add a conditional branch instruction
				BasicBlock *H2 = H1->splitBasicBlock(InsertPos, "rse.H2");
				BasicBlock *AddToFreeBB = H1->splitBasicBlock(NewFreeHead, "rse.addtofree");

				// Insert a branch depending on mark.
				BranchInst *BI = BranchInst::Create(AddToFreeBB, H2, Mark);
				BI->setDebugLoc(DebugLoc);
				ReplaceInstWithInst(H1->getTerminator(), BI);

				// Add a PHINode into H2
				InsertPos = BasicBlock::iterator(H2->getFirstNonPHI());
				PHINode *FreeNextPHI = PHINode::Create(ListTy->getPointerTo(), 2,
				"rse.newfree.H",
				&*InsertPos);
				FreeNextPHI->setDebugLoc(DebugLoc);
				NewFreeHead->replaceAllUsesWith(FreeNextPHI);
				FreeNextPHI->addIncoming(FreeList, H1);
				FreeNextPHI->addIncoming(NewFreeHead, AddToFreeBB);
				}

				// Store the old list head into the last element of the item. This is done
				// after potentially splitting of the conditional branch, to avoid messing
				// up with `NewFreeHead->replaceAllUsesWith(FreeNextPHI);`, we want to use
				// this the original value here.
				InsertPos = std::next(BasicBlock::iterator(NewFreeHead));
				// Compute of last next ptr.
				Value *IdxNext[] = {
				ConstantInt::get(DL.getIntPtrType(C, DL.getAllocaAddrSpace()), Marker->getItemIndex(OtherCandidatesNumOf - 1)),
				ConstantInt::get(IntegerType::get(C, 32), 1),
				ConstantInt::get(IntegerType::get(C, 32), 0),
				};
				Instruction *FreeNext = GetElementPtrInst::CreateInBounds(ListTy,
				NewFreeHead,
				makeArrayRef(IdxNext, 3),
				"rse.freenextptr",
				&*InsertPos);
				FreeNext->setDebugLoc(DebugLoc);

				StoreInst *SI = new StoreInst(FreeList,
				FreeNext,
				false,
				DL.getABITypeAlignment(ListTy->getPointerTo()),
				&*InsertPos);
				SI->setDebugLoc(DebugLoc);
				}

				/// Transforms the path which does not return on the old return such that it
				/// pick the information from the itemlist, and loops back to the loopentry
				/// after executing the relevant parts of the function H.
				//
				// pull:
				// %listempty = icmp %listhead, null
				// br %listempty, %return, %H1
				//
				// H1:
				// %static1 = call @fn1, %movable1
				// call @recursive ...
				//
				// %static2 = call @fn1, %movable2
				// call @recursive ...
				//
				// br return
				//
				// ->
				//
				// pull:
				// %listempty = icmp %listhead, null
				// br %listempty, %return, %H1
				//
				// H1:
				// ; load data from list
				// %listitem = load %listhead
				//
				// ; extract data from list (mark is omitted if necessary)
				// %next = extractvalue %listitem, 1, 0
				// %mark = <mark instructions>
				// %operand.0 = extractvalue %listitem, 0, 0
				//
				// ; execute function H but replace the recursive call by a branch into the loop entry
				// %static1 = call @fn1, %operand.0
				//
				// ; Note, %static1 is used to adjust the corresponding argument phi.
				// br %loopentry
				//
				template <typename ArgumentPHIContainer>
				void transformHExecution(
				CandidateContainer &OtherCandidates,
				BasicBlock *HExecuteBB,
				PHINode ListHead, PHINode FreeList,
				BasicBlock *LoopEntry,
				const ArgumentPHIContainer &ArgumentPHIs) {

				Instruction *InsertPos = HExecuteBB->getFirstNonPHIOrDbg();

				// Split the old static instruction which will become unreachable. These are deleted later.
				BasicBlock *StaticUnreachBB = HExecuteBB->splitBasicBlock(InsertPos, "rse.staticunreachable");
				InsertPos = HExecuteBB->getTerminator();

				RecursiveCall &CC = OtherCandidates.front();
				const DebugLoc &DebugLoc = CC.getCallInst()->getDebugLoc();

				// Load the list head content.
				LoadInst *ListItem = new LoadInst(ListHead,
				"rse.listitem",
				false,
				DL.getABITypeAlignment(ListTy),
				InsertPos);
				ListItem->setDebugLoc(DebugLoc);

				// Extract the elements of the listitem to prepare the operands.
				unsigned Idx[2] = { 1, 0 };

				ExtractValueInst *nextPtr = ExtractValueInst::Create(
				ListItem, makeArrayRef(Idx, 2), "rse.next", InsertPos);
				nextPtr->setDebugLoc(DebugLoc);

				unsigned IncomingCount = IncomingArguments.size();
				SmallVector<Value*, 4> Incoming;
				Incoming.reserve(IncomingCount);
				Idx[0] = 0;

				Idx[1] = 0;
				for (IncomingArgument &IA : IncomingArguments) {
				Value *V = IA.getValue();
				if (IA.isConstant() \|\| IA.isStatic()) {
				Incoming.push_back(V);
				}
				else {
				Value *EVI = ExtractValueInst::Create(
				ListItem, makeArrayRef(Idx, 2), "rse.operand." + Twine(Idx[1]), InsertPos);
				Incoming.push_back(EVI);
				Idx[1]++;
				}
				}

				// Clone instructions from function H back into the code sequence.
				Function *FunctionH = CC.getFunctionH();
				assert(FunctionH && "FunctionH not present.");

				CallInst *CI = nullptr;
				std::map<Value, Value> VVMap;
				for (Instruction &I: FunctionH->getEntryBlock()) {
				if (&I == FunctionH->getEntryBlock().getTerminator())
				continue;

				Instruction *NewI = I.clone();
				VVMap[&I] = NewI;
				NewI->insertBefore(InsertPos);
				CI = dyn_cast<CallInst>(NewI);

				for (Use &U: NewI->operands()) {
				Value *V = U.get();

				auto VVMI = VVMap.find(V);

				// For Argument we replace the operand with the corresponding item in Incoming.
				if (Argument *A = dyn_cast<Argument>(V)) {
				U.set(Incoming[A->getArgNo()]);
				}
				// For Values within the Value-Value map use VVMap to resolve the relation.
				else if (VVMI != VVMap.end()) {
				U.set(VVMI->second);
				}
				// Otherwise it must be a constant that can be left unchanged.
				else {
				assert(isa<Constant>(V) && "could not transform. Not a constant");
				}
				}
				}

				assert(CI && CI->getCalledFunction() == &F && "Last instruction in FunctionH was not a recursive CallInst");

				// Redirect Terminator.
				BranchInst *BI = dyn_cast<BranchInst>(HExecuteBB->getTerminator());
				assert(BI && BI->isUnconditional() && "Invalid Terminator in stackrecurse.");
				BI->setSuccessor(0, LoopEntry);

				// Adjust the List PHI Node in the loop entry.
				ListHead->addIncoming(nextPtr, HExecuteBB);

				// Adjust Argument PHI Nodes in the loop entry.
				adjustPHIsFromCallInst(HExecuteBB, CI, ArgumentPHIs);

				// Finally erase the CallInst.
				CI->eraseFromParent();

				// Insert Code for stack cleanup.
				insertAddToFreeListBB(ListItem, nextPtr, ListHead, FreeList, DebugLoc);

				// Remove the instructions which became unreachable.
				for (auto &Cand: OtherCandidates) {
				Cand.eraseSourceInstructions();
				}

				assert(&StaticUnreachBB->front() == StaticUnreachBB->getTerminator() && "Not all instructions have been removed");

				StaticUnreachBB->eraseFromParent();
				}

				/// Redirect all returning blocks into the the new return block.
				void redirectRemainingReturnBlocks(BasicBlock PullBB, BasicBlock NewReturnBB) {
				for (auto &BB: F) {
				if (&BB == NewReturnBB)
				continue;

				if (ReturnInst *RI = dyn_cast<ReturnInst>(BB.getTerminator())) {
				BranchInst *BI = BranchInst::Create(PullBB, RI);
				BI->setDebugLoc(RI->getDebugLoc());
				RI->eraseFromParent();
				}
				}
				}

				/// Call the transformation functions above to apply the transformation.
				void eliminateRecursion() {
				SmallVector<PHINode *, 4> ArgumentPHIs;
				PHINode *ListHead;
				PHINode *FreeList;
				BasicBlock *LoopEntry;

				std::tie(ListHead, FreeList, LoopEntry) = prepareEntry(ArgumentPHIs);

				BasicBlock CallerBB, StaticInstBB;
				std::tie(CallerBB, StaticInstBB) =
				eliminateFirstCall(FirstCandidate.front(), LoopEntry, ArgumentPHIs);

				eliminateOtherCalls(
				OtherCandidates,
				CallerBB,
				ListHead,
				FreeList);

				BasicBlock *PullBB;
				BasicBlock *HExecuteBB;
				BasicBlock *NewReturnBB;
				std::tie(PullBB, HExecuteBB, NewReturnBB) =
				createNewReturnBlock(StaticInstBB, &OtherCandidates.front(), ListHead);

				transformHExecution(
				OtherCandidates,
				HExecuteBB, ListHead, FreeList, LoopEntry, ArgumentPHIs);

				redirectRemainingReturnBlocks(PullBB, NewReturnBB);

				for (PHINode *PN : ArgumentPHIs) {
				// If the PHI Node is a dynamic constant, replace it with the value it is.
				if (Value *PNV = SimplifyInstruction(PN, F.getParent()->getDataLayout())) {
				PN->replaceAllUsesWith(PNV);
				PN->eraseFromParent();
				}
				}

				if (Value *PNV = SimplifyInstruction(FreeList, F.getParent()->getDataLayout())) {
				FreeList->replaceAllUsesWith(PNV);
				FreeList->eraseFromParent();
				}

				ORE.emit([&]() {
				return OptimizationRemark(DEBUG_TYPE, "Recursion eliminated",
				F.getEntryBlock().front().getDebugLoc(), &F.getEntryBlock())
				<< "Stack modeled with: " << ore::NV("Width", OtherCandidatesNumOf + 1);
				});
				}
				};

				/// Wrapper class for legacy pass manager.
				struct RecursionStackElimination : public FunctionPass {
				static char ID;
				RecursionStackElimination() : FunctionPass(ID) {
				initializeRecursionStackEliminationPass(*PassRegistry::getPassRegistry());
				}

				void getAnalysisUsage(AnalysisUsage &AU) const override {
				AU.addRequired<AAResultsWrapperPass>();
				AU.addRequired<OptimizationRemarkEmitterWrapperPass>();
				AU.addPreserved<GlobalsAAWrapperPass>();
				}

				bool runOnFunction(Function &F) override {
				if (DisableRecursionStackElimination)
				return false;

				if (skipFunction(F))
				return false;

				return RecursionStackEliminationImpl(
				F,
				getAnalysis<AAResultsWrapperPass>().getAAResults(),
				getAnalysis<OptimizationRemarkEmitterWrapperPass>().getORE()).run();
				}
				};
				}

				char RecursionStackElimination::ID = 0;
				INITIALIZE_PASS_BEGIN(RecursionStackElimination, DEBUG_TYPE, "Recursion Stack Elimination",
				false, false)
				INITIALIZE_PASS_DEPENDENCY(AAResultsWrapperPass)
				INITIALIZE_PASS_DEPENDENCY(OptimizationRemarkEmitterWrapperPass)
				INITIALIZE_PASS_END(RecursionStackElimination, DEBUG_TYPE, "Recursion Stack Elimination",
				false, false)

				/// Public interface to the RecursionStackEliminationination pass.
				FunctionPass *llvm::createRecursionStackEliminationPass() {
				return new RecursionStackElimination();
				}

				/// Wrapper class for experimental pass manager.
				PreservedAnalyses RecursionStackEliminationPass::run(Function &F,
				FunctionAnalysisManager &AM) {
				if (DisableRecursionStackElimination)
				return PreservedAnalyses::all();

				bool Changed = RecursionStackEliminationImpl(
				F,
				AM.getResult<AAManager>(F),
				AM.getResult<OptimizationRemarkEmitterAnalysis>(F)).run();

				if (!Changed)
				return PreservedAnalyses::all();

				PreservedAnalyses PA;
				PA.preserve<GlobalsAA>();

				return PA;
				}

lib/Transforms/Scalar/Scalar.cpp

Show First 20 Lines • Show All 87 Lines • ▼ Show 20 Lines	void llvm::initializeScalarOpts(PassRegistry &Registry) {
initializeRewriteStatepointsForGCLegacyPassPass(Registry);		initializeRewriteStatepointsForGCLegacyPassPass(Registry);
initializeSCCPLegacyPassPass(Registry);		initializeSCCPLegacyPassPass(Registry);
initializeSROALegacyPassPass(Registry);		initializeSROALegacyPassPass(Registry);
initializeCFGSimplifyPassPass(Registry);		initializeCFGSimplifyPassPass(Registry);
initializeStructurizeCFGPass(Registry);		initializeStructurizeCFGPass(Registry);
initializeSimpleLoopUnswitchLegacyPassPass(Registry);		initializeSimpleLoopUnswitchLegacyPassPass(Registry);
initializeSinkingLegacyPassPass(Registry);		initializeSinkingLegacyPassPass(Registry);
initializeTailCallElimPass(Registry);		initializeTailCallElimPass(Registry);
		initializeRecursionStackEliminationPass(Registry);
initializeSeparateConstOffsetFromGEPPass(Registry);		initializeSeparateConstOffsetFromGEPPass(Registry);
initializeSpeculativeExecutionLegacyPassPass(Registry);		initializeSpeculativeExecutionLegacyPassPass(Registry);
initializeStraightLineStrengthReducePass(Registry);		initializeStraightLineStrengthReducePass(Registry);
initializePlaceBackedgeSafepointsImplPass(Registry);		initializePlaceBackedgeSafepointsImplPass(Registry);
initializePlaceSafepointsPass(Registry);		initializePlaceSafepointsPass(Registry);
initializeFloat2IntLegacyPassPass(Registry);		initializeFloat2IntLegacyPassPass(Registry);
initializeLoopDistributeLegacyPass(Registry);		initializeLoopDistributeLegacyPass(Registry);
initializeLoopLoadEliminationPass(Registry);		initializeLoopLoadEliminationPass(Registry);
▲ Show 20 Lines • Show All 127 Lines • ▼ Show 20 Lines
void LLVMAddSimplifyLibCallsPass(LLVMPassManagerRef PM) {		void LLVMAddSimplifyLibCallsPass(LLVMPassManagerRef PM) {
// NOTE: The simplify-libcalls pass has been removed.		// NOTE: The simplify-libcalls pass has been removed.
}		}

void LLVMAddTailCallEliminationPass(LLVMPassManagerRef PM) {		void LLVMAddTailCallEliminationPass(LLVMPassManagerRef PM) {
unwrap(PM)->add(createTailCallEliminationPass());		unwrap(PM)->add(createTailCallEliminationPass());
}		}

		void LLVMAddRecursionStackEliminationinationPass(LLVMPassManagerRef PM) {
		unwrap(PM)->add(createRecursionStackEliminationPass());
		}

void LLVMAddConstantPropagationPass(LLVMPassManagerRef PM) {		void LLVMAddConstantPropagationPass(LLVMPassManagerRef PM) {
unwrap(PM)->add(createConstantPropagationPass());		unwrap(PM)->add(createConstantPropagationPass());
}		}

void LLVMAddDemoteMemoryToRegisterPass(LLVMPassManagerRef PM) {		void LLVMAddDemoteMemoryToRegisterPass(LLVMPassManagerRef PM) {
unwrap(PM)->add(createDemoteRegisterToMemoryPass());		unwrap(PM)->add(createDemoteRegisterToMemoryPass());
}		}

Show All 39 Lines

This is an archive of the discontinued LLVM Phabricator instance.

[MultiTailCallElimination]: Pass to eliminate multiple tail callsNeeds ReviewPublic

Details

Algorithm outline:

Diff Detail

Event Timeline

Work-list

Free-List

List-Items

Markers

The return code for markers 1,2,3 and 4

Revision Contents

Diff 171096

include/llvm-c/Transforms/Scalar.h

include/llvm/InitializePasses.h

include/llvm/Transforms/Scalar.h

include/llvm/Transforms/Scalar/RecursionStackElimination.h

lib/Passes/PassBuilder.cpp

lib/Transforms/IPO/PassManagerBuilder.cpp

lib/Transforms/Scalar/CMakeLists.txt

lib/Transforms/Scalar/RecursionStackElimination.cpp

lib/Transforms/Scalar/Scalar.cpp

[MultiTailCallElimination]: Pass to eliminate multiple tail calls
Needs ReviewPublic