Page MenuHomePhabricator

[MultiTailCallElimination]: Pass to eliminate multiple tail calls
Needs ReviewPublic

Authored by marels on Oct 25 2018, 8:26 AM.

Details

Summary

This pass converts multiple tail recursive calls into a loop, by modeling the
calls as a single linked worklist explicitly on the stack.

void f(a, b, const c) {
  ...
  if (...)
    return

  a1, b1, b2, b3, x2, x3 = ...

  f(a1, b1, c) // 1st recursion
  a2 = h(x2)
  f(a2, b2, c) // 2nd recursion
  a3 = h(x3)
  f(a3, b3, c) // 3rd recursion
}

transforms to

void f(a, b, const c) {
  struct { x, b, next } *worklist = null

loop:
  ...
  if (...)
    goto check_return

  a1, b1, b2, b3, x2, x3 = ...

  /* Assign arguments for the first call */

  a = a1
  b = b1

  /* Put arguments of the remaining calls into the work list */

  queue = alloca(2 * sizeof(*worklist))
  queue[0] = {x2, b2, &queue[1]}
  queue[1] = {x3, b3, worklist}
  worklist = queue

  goto loop

check_return:
  if (!worklist)
    return
  a = h(worklist->x)
  b = worklist->b
  worklist = worklist->next

  goto loop
}

Such patterns occur for example when an application traverses k-ary full trees.

The benefits of this transformation is, that neither frame nor link address
have to be stored on the stack. Also pass through arguments, like 'const c'
in example above, are less likely required to saved onto the stack, and if so
less often (only once in the entry block).

The downsides are:

  • a) An additional store for the worklist is required.
  • b) The worst case required stack memory after this transformation depends on the number of recursive calls, instead of the recursion depth.
  • c) Additional conditional branches
  • ad a) This point is compensated by avoiding storing the link address in each call.
  • ad b) This pass additionally can add (and does it by default) code to manage unused blocks of allocas in a freelist. Because allocations are done in blocks the freelist management, can also be done in blocks. This requires the last item in each block to be marked, detectable by the code within return. Currently the following mark algorithms are implemented: A further variable is stored on the stack containing the current freelist. On the other hand by not executing a recursion the frame pointer loads and stores are omitted.
  • ad c) The pass adds 1 conditional branch into the return path and 2 additional branches for freelist management (see (b) above). Depending on the target, machine branch prediction can elevate this.

Algorithm outline:

Analysis Phase:

  1. Analyze the function and gather the returning basic blocks (BB) (and BB branching to return-only BB) with recursive calls (RC).
  2. If more the one BB or no RC is found abandon the transformation.
  3. If the remaining BB has only one RC abandon the transformation. Otherwise let N be number of RC in this BB.
  4. Analyze the instructions in this BB and from the first RC until the terminator instruction and classify each instruction as movable and static. A movable instruction is and instruction that can be safely moved before the first RC. All other instructions are classified static.
  5. Assign each static instruction to the following RC instruction. If static instructions are left after the last RC abandon the transformation.
  6. Build the function H with all its arguments for each RC. By including the call itself in function H it is ensured that this function and enforcing this function to return void. It is ensured that there are no escaping values uses after the recursion. Note (6): By the way step 4 is executed it is guaranteed that function H for the first RC consists of a single instruction; the call itself. The first call candidate is handled special (the same way as in Tail Recursion Elimination (TRE)). Note (5,6): The information collected on each RC is collected in the structure RecursiveCall.
  7. Compare the second's function H with all later ones. The behavior must match, otherwise abandon the transformation.
  8. Note: As the first RCs function H basically a TRE it can be ignored in this step.
  9. Simplify the argument list by removing constants and pass through arguments.
  10. Decide whether it is profitable to use the transformation.

Transformation Phase:

  1. Adjust entry block and split of the loop entry.
  2. Eliminate the first RC (similar to TRE).
  3. Eliminate the remaining RC by allocating and filling an array new (or pick it from the freelist) block of N-1 struct items. This array is put in the front of the list. Pulling the list is added in (4). The execution of function H is ensured in (5).
  4. Create a new return block which pulls items from the worklist. If an and of block marker is reached. The block is put into the freelist.
  5. Add the instruction from function H into the return block and create the loop.
  6. Redirect each returning block into the new return block created in (4).
  7. Drop the constant STACKGROWTHDIRECION. It is manly uses as a proof of concept for Aarch64.

Open issues and known TODOs - It would be great if reviewer could comment on
those as well:

  1. Pipeline integration: Currently the pass is put before TRE. This includes some supporting passes, for cleanup and preparation. This cannot be left as is. The preferred way could be by adjusting "AArch64TargetMachine::adjustPassManager" and only use it with Opt3 . AAarch64 is selected, because this is the architecture the author has suitable benchmarking and test setups available. This was tried once (in a similar way as in AMDGPUTargetMachine::adjustPassManager), however the result was that may LLVM projects did not compile anymore because of linker problems (Passes library was missing). Do you have any advise here?
  2. The way to test if it profitable to use the pass needs adjustment. I think that functions, that spill more registers have an increased chance to profit, while functions that spill less, have lower chance for profit.
  3. Thinking of a configurable way (maybe a separate marker class) to adjust the way markers implemented. E.g.: putting the marker bit into the pointer on Aarch64 show a significant performance boost (test implemented in C).
  4. Is it safe to temporary create function and return no-changes after deleting it? If not is there a better way to than calling using the FunctionComparator?
  5. GlobalsAA needs to preserved. Not sure about this in this context. Loads and Stores are added here.

Diff Detail

Event Timeline

marels created this revision.Oct 25 2018, 8:26 AM
marels edited the summary of this revision. (Show Details)Oct 25 2018, 8:40 AM

Revised broken formatting in summary

Very interesting work!

john.brawn added a comment.EditedNov 13 2018, 6:00 AM

A few general comments:

For how to integrate this into the pass pipeline, I think it probably makes sense to put this just after inlining as this is kind of like inlining - we're seeing some function calls and eliminating them by putting extra stuff into this function (I experimented with this and it seemed to work). Instead of having the target insert the pass I think it makes more sense to have a heuristic to decide when to do this transformation that depends on the target, with a default implementation of "don't do this" (see next point).

When we do this transformation we are:

  • Saving the cost of the call instruction
  • Saving the cost of any saving and restoring of registers in the recursed function
  • Saving the cost of having to save the unchanging arguments across several calls
  • Incurring the cost of managing the call list and free list
  • Have the opportunity to CSE/LICM subexpressions using the unchanging arguments

so in terms of heuristics we want to do it if (saved_cost + expected_opportunity) > incurred_cost, and the saved cost is dependant mainly on the cost of saving/restoring callee-saved registers (or at least that's what it looks like on aarch64). So it should involve some kind of calls to cost functions in the target.

The current implementation does the call list as one entry = one recursive call, so the cost of managing the call list is proportional to the number of recursive calls. We could instead have a 'chunk' of calls equal to the number of recursive calls, e.g.

struct tree {
  double val;
  struct tree *children[4];
};

void function_to_optimise(struct tree *p, const double a, const double b) {
  if (!p)
    return;

  p->val += sin(a) + sin(b);
  f(p->children[0], a, b);
  f(p->children[1], a, b);
  f(p->children[2], a, b);
  f(p->children[3], a, b);
}

struct chunk {
  struct chunk *next;
  long int idx;
  struct tree *vals[4];
};
void function_optimised(struct tree *p, const double a, const double b) {
  struct chunk *list = 0;
  struct chunk *freelist = 0;
  struct tree *current_p = p;
  goto first;

  while(list) {
    // move to next chunk if at end
    while (list->idx >= 4) {
      struct chunk *tmp = list;
      list = list->next;
      tmp->next = freelist;
      freelist = tmp;
      if (!list)
        return;
    }
    current_p = list->vals[list->idx++];

  first:
    // early exit
    if (!current_p)
      continue;

    // do the operation
    current_p->val += sin(a) + sin(b);

    // add recursive calls to list
    struct chunk *tmp;
    if (freelist) {
      tmp = freelist;
      freelist = freelist->next;
    } else {
      tmp = alloca(sizeof(struct chunk));
    }
    tmp->idx = 0;
    memcpy(tmp->vals, current_p->children, 4 * sizeof(struct tree *));
    tmp->next = list;
    list = tmp;
  };
}

We now only have one allocation / list manipulation instead of one per recursive call, though at the loop head we have some extra complexity.

Also I think the current freelist handling isn't quite right - looking at the generated code it's checking on the _first_ time it adds to the worklist if there's an element in the freelist it can use so e.g. if we have 4 recursive calls to add and 2 freelist entries it will only use the first freelist entry and do 3 allocations, but it should do 2 allocations and use the 2 freelist entries. (Using the chunked approach would avoid this as only one chunk is ever added at once.)

Also, we should be disabling this transformation when optimising for size.

Hi & Thank you for the input:

A few general comments:

[For how to integrate this into the pass pipeline ...]:

I will take a look into this.

When we do this transformation we are:

  • Saving the cost of the call instruction
  • Saving the cost of any saving and restoring of registers in the recursed function
  • Saving the cost of having to save the unchanging arguments across several calls
  • Incurring the cost of managing the call list and free list
  • Have the opportunity to CSE/LICM subexpressions using the unchanging arguments so in terms of heuristics we want to do it if (saved_cost + expected_opportunity) > incurred_cost, and the saved cost is dependant mainly on the cost of saving/restoring callee-saved registers (or at least that's what it looks like on aarch64). So it should involve some kind of calls to cost functions in the target.

This is an interesting idea I will take into account for my next update.

Also I think the current freelist handling isn't quite right - looking at the generated code it's checking on the _first_ time it adds to the worklist if there's an element in the freelist it can use so e.g. if we have 4 recursive calls to add and 2 freelist entries it will only use the first freelist entry and do 3 allocations, but it should do 2 allocations and use the 2 freelist entries. (Using the chunked approach would avoid this as only one chunk is ever added at once.)

I am not sure what you mean with "4 recursive calls". Do you mean depth or width. The number of alloca's executed depends on the recursion depth and not on the width. Compared to you approach (see next point) currently the items are already alloced in chunks where the "next" pointers are assigned after allocation.

Could you please provide a more details on this. Sorry, I cannot follow your point.

[chunked approach]

I head a similar idea while implementing this but I did not implement because I thought it required non uniform code generation for the first and the remaining recursive calls. Note in case of 4 recursive calls, like in your example, only three items are maintained in the worklist; the first recursion is never put in the worklist because it can be handled just like tail recursion. However I think you example can be adopted to do this as well.

Thinking of the reducing the overhead for freelist management, I think there is a simple way to adopt the current implementation as well:

Currently the exit path looks like this:

loopentry:
  ...

return_path:
  if (worklist == nullptr)
    return;

  mark = marker.isLastItem();
  if (mark) {
    // The values of FIRST_ITEM and  LAST_TEM is computed at compile time (it is a Constant)
    currentlist[LAST_ITEM].next = freelist;
    freelist = currentlist[FIRST_ITEM];
  }

  current_params = worklist.params;
  worklist = worklist.next;
  goto loopentry;

This implementation requires a check for adding to the freelist and a check for termination in each loop iteration. However it is an invariant that:

(mark == true) -> (listitem->next != nullptr)

Using this information the exitpath can be changed to

loopentry:
  ...

return_path:
  mark = marker.isLastItem();
  if (mark) {
    currentlist[LAST_ITEM].next = freelist;
    freelist = currentlist[FIRST_ITEM];
    if (worklist == nullptr)
      return;
  }

  current_params = worklist.params;
  worklist = worklist.next;
  goto loopentry;

In this case it is also save that worklist is not null when fetching the parameters for the next iteration from it. Note that worklist always maintains the information for the next iteration.

Comparing this to the your proposal I think it should behave similarly regarding performance. (In compare for inner callse, 2 compares for the last call, one compare to put items into the worklist)
Note that currently the data is allocated in chunks of (n-1) items, where n is the witdh of the recursion. The main difference is the way the list item in the chunk is marked. You use an extra index per chunk, the current implementation some uses information stored in each item within a chunk (Bit in pointer (I can push a prototype implementation for this), extra field in item, using address and stack properties and comparing worklist < worklist->next).

Summary:

Alloca Overhead:

  • Currently: On alloca the inner structure of the allocated chunk needs to be initialised and creates an overhead (item[0].next = &item[1], ...); This steps are not executed when using items within the freelist.
  • Your Approach: On alloca no overhead is created

Insert To Worklist Overhead:

  • Current: Pointer manipulation
  • Your Approach: Pointer Manipulation and 1 store to initialise the index

Exit Path:

  • Currently: Check Return, Handle Freelist, Pointer Manipulation
  • Currently: [using the modifications above]: Check Freelist, Check Return [iff last item reached], Pointer Manipulation
  • Your Approach: Check Index, Increment Index, Check Return [iff last item reached], Pointer Manipulation [iff last item reached]

Also, in your approach the index can be omitted if the recursion witdh is 2, but similar things are done in the current approach.

I think both approaches are similar regarding performance. I will try this approach and do some measurements, unless you (@john.brawn) already dis some performance tests and provide results.
If you proposal is beneficial or equal in performance I I will adopt it et least for the following reasons:

  1. We can get rid of the marker code which complicates the pass and also may be target dependent (Growing, Shrinking Stack; or Stack alignment to tag pointer).
  2. Slightly less memory used
  3. By changing your increasing to a decreasing pointer, the pass becomes more flexible and can easily adopted to more rare cases like:
void fn(....) {
  //execute body

  if (...) {
    // three recursions
    fn(...)
    fn(h(...), ...)
    fn(h(...), ...)
  }
  if (...) {
    // five recursions
    fn(...)
    fn(h(...), ...)
    fn(h(...), ...)
    fn(h(...), ...)
    fn(h(...), ...)
  }
}
marels added a comment.EditedNov 20 2018, 8:36 AM

I think there was some confusion in how the lists are managed.

From code analyses the pass knows how wide the recursion is. Wide means the number of recursive calls within a the function. For example qsort would have 2. Traversing an Octtree calls itself 8 times.

Work-list

When the code reaches the first of those n calls, all information is available and the call can be 'emulated' be adjusting some PHINode and
looping to the start of the function. This is basically the same as for Tail Call Elimination. However, before doing so information about the remaining n-1 calls need to be queued in the work-list.

While the allocation of the work-list items is done in chunks (array) of n-1 consecutive items, they are internally linked in single linked
list. Items are always added in chunk of n-1 at the front, and removed one by one from the front.

To clarify head pointer of the work-list always points to the next item to be processed and NOT to the current item. The information of the current arguments is extracted before branching to the loop-entry.

Because the allocation is done in chunks, when processing the last item, the address of the first item can be computed by an subtraction (in this our case this is done by a GEP instruction).

This is also the point where free list management comes into play. To reduce the amount of stack required unused chunks are stored within a free-list.

Free-List

The free-list is also a linked list, but in contrast to the work-list it logical links chunks. As link pointer the last next pointer the array is used.

Whenever a new chunk is allocated it is first taken from the free list if there is one available. If the free-list is empty a new chunk is allocated by executing an alloca instruction.

List-Items

To execute the algorithm each item stores the following information.

  • Arguments: These are the function arguments that are necessary to execute the function.
  • Next-Pointer: This is the link pointer to maintain the worklist.
  • Marker: The marker used to mark the last item in a chunk. Whenever a marked item is removed from the work-list. The complete chunk has to be put into the free-list.

Example Work-List:

After a couple of steps the work-list might log like denoted in Fig. 1 while executing Step 3.3 for a recursion that with wideness 4 (e.g. Quad-Tree traversing). Note the Step 1 and 3.1 are omitted because the are never allocated within the work-list.

The first column denotes the Arguments; the second the Marker, and the third the next pointer.

Fig. 1: Work-List and free-List while executing Step 3.3. Note that
work-list already points to Step 3.4 even when currently executing
Step 3.3.

+----------+---+-----+
| Step 2   |   | *   |
+----------+---+ | --+       +----------+---+-----+
| Step 3   |   | v * |       | Step 3.2 |   | *   |
+----------+---+-- | |       +----------+---+ | --+
| Step 4   | M | * v | <-\   | Step 3.3 |   | v * |
+----------+---+-|---+   |   +----------+---+-- | +
		 v	 |   | Step 3.4 | M | * v | <- [work-list]
	      nullptr	 |   +----------+---+ | --+
                         |                    |
                         \--------------------/

                             +----------+---+-----+
			     |          |   | *   | <- [free-list]
			     +----------+---+ | --+
                             |          |   | v * |
			     +----------+---+-- | +
			     |          | M | * v |
			     +----------+---+ | --+
			                      v
					   nullptr

Fig. 2 show the state of the maintained structures while executing Step 3.4. The changes are made just before branching to the loop entry.

Fig. 2: Work-list and free-list while executing Step 3.4.

+----------+---+-----+
| Step 2   |   | *   |
+----------+---+ | --+       +----------+---+-----+
| Step 3   |   | v * |       |          |   | *   | <- [free-list]
+----------+---+-- | |       +----------+---+ | --+
| Step 4   | M | * v | <-\   |          |   | v * |
+----------+---+-|---+   |   +----------+---+-- | +
		 v	 |   |          | M | * v |
	      nullptr	 |   +----------+---+ | --+
                         |                    |
                         \--- [work-list]     \-\
                                                |
                             +----------+---+-- | +
			     |          |   | * v |
			     +----------+---+ | --+
                             |          |   | v * |
			     +----------+---+-- | +
			     |          | M | * v |
			     +----------+---+ | --+
			                      v
					   nullptr

From Fig. 1 and Fig. 2 one see the following:

  1. The next-pointer of unmarked elements are constant. They are only assigned once when allocating a new chunk. Chunks within the free-list already contain the correct information.
  1. Because of (1) the next pointer of unmarked items always point the a valid item. Thus a return check can be omitted for unmarked items.
  1. The free list management is only necessary if a marked item is removed from the work-list.

Markers

In order to check if free-list management is necessary a marker algorithm is executed. To determine the M field.

The next section list the algorithms that have been investigated so far:

1) Field Marker: Field markers maintain an explicit bit that stores the M flag in a separate field. An item is marked if the bit is set. The chunks look like this:

struct chunk {
  struct item {
    struct { ... } Arguments;
    struct item *Next;
    bool Marker;
  } items[N-1];
};

2) Chunked Marker (by @john.brawn): Chunked Markers omit the next pointer and replace them by an index storing a reference to the next item (+1). The worklist always points to the next chunk in the queue. An marked item is reached iff when decrementing the index it becomes 0. A chunk looks like this.

struct chunk {
  struct item {
    struct { ... } Arguments;
  } items[N-1];
  unsigned Index;
  struct chunk *Next.
};

3) Compare Marker: The Compare Marker makes use of the order in which chunks are allocated. Depending on the stack growth direction the marking can be determined by executing (item < item->next) or (item > item->next).

I don not go into the details but this works as long as the following condition holds when executing 2 allocas in a temporal order.

Assume:

a = alloca(X)
b = alloca(Y)

If X > 0 and Y > 0 then either (a < b) or (a > b) must hold.

However I think LLVMs alloca semantics (theoretically) might break this requirement.

A chunk looks like this:

struct chunk {
  struct item {
    struct { ... } Arguments;
    struct item *Next;
  } items[N-1];
};

4) Tagged Marker A: Tagged Marker use the same chunk layout as Compare Markers. The difference is that the marking is encoded within Bit 0 of pointers that point to the item. An item is marked it Bit 0 in the pointer is cleared.

This works as long as all the alignment of each list item is 2 or a multiple of 2. And that if before dereferencing an item pointer Bit 0 is masked.

The return code for markers 1,2,3 and 4

Each return in the function is replaced by a branch to the following
return code.

returnpath:
  if (!worklist)
    return;

  tie(marked, item, oldchunk) = marker->execute_and_advance(worklist);
  if (marked)
     marker->add_to_freelist(oldchunk);

  next_arguments = marker->advance_and_return_next_item(worklist);

  // execute H(...) here
  goto loop_entry;

5) Tagged Marker B: Beside the tagged marker each marker must dereference the work-list thus the return check must be executed first and for each element. Because tagged markers only need the pointer value itself, the return code for that tagged marker can be further optimized.

returnpath_tagged:
  tie(marked, item, oldchunk) = marker->execute_and_advance(worklist);

  if (marked) {
    if (!worklist)
      return;

    marker->add_to_freelist(oldchunk);
  }

  next_arguments = marker->advance_and_return_next_item(worklist);

  // execute H(...) here
  goto loop_entry;

6) Always True Marker: This is a trivial marker which is always (and implicitly true). This marker applies to function with wideness 2 only. It can be used with the tagged return path.

7) Always False Marker: This trivial marker disables free-lists at all. It must be with the untagged return path.

marels added a comment.EditedNov 20 2018, 9:01 AM

I also did some measurements on a use-case to compare the marker algorithm (AArch64): Physics-Simulation in the hot loop the application traverses a full octree.

The resulting performance was: PassDisabled < Chunked < Field < Compare = Tagged A < Tagged B.
Note: I do not have an llvm implementation for the Chunked Marker do I tested it on modified C code.

Because the Tagged B marker I am tending to redo this part of the code generation and drop all other markers besides the always true and maybe the always false marker algorithm.

What do you think about it?

marels updated this revision to Diff 176633.Dec 4 2018, 7:19 AM

Changes

  • Renamed Pass to MultiTailCallElimination as it seems more fitting
  • Removed Unnecessary Marker Implementation
  • Changed Code Generation quite a bit
  • Add heuristic allowing to skip transformation if not profitable
  • Add Test Cases
  • Enable pass only for O3

I am still not happy with the pipeline integration. A can only reasonable test this for AArch64 and want to enable it only for this target. Other machines might need different heuristics.

Any Ideas on this?

lebedev.ri added inline comments.
lib/Transforms/Scalar/MultiTailCallElimination.cpp
410

Please apply clang-format to this patch.
80 col width.

test/Transforms/MultiTailCallElimination/lit.local.cfg
1–2

This looks wrong.
This should then be in test/Transforms/MultiTailCallElimination/AArch64/lit.local.cfg
The top-level test/Transforms/MultiTailCallElimination/lit.local.cfg should not be target-specific,
or at least not specific to some one target.

marels marked 2 inline comments as done.Dec 4 2018, 7:39 AM
marels added inline comments.
lib/Transforms/Scalar/MultiTailCallElimination.cpp
410

I will go through the source again fix these kind of things in the complete file. Thanks for the remainder.

test/Transforms/MultiTailCallElimination/lit.local.cfg
1–2

I feel the same. I will fix this with the next commit.
However this is currently related to the Pipeline Integration issue. Which I currently do not like at all. Do you have an idea how to enable this pass specifically for AArch64. Clearly the pass has to run before TailCallElimination as TCE will pick of the last call and disable this pass completely.

lebedev.ri added inline comments.Dec 4 2018, 7:49 AM
test/Transforms/MultiTailCallElimination/lit.local.cfg
1–2

Probably based on a target triple specified in the IR.
Though i can't tell how to reach that info from this pass..

But well, this is middle end, these passes generally shouldn't be too target arch specific.
Is this likely fundamentally wrong for non-aarch64?

marels added a comment.Dec 4 2018, 8:08 AM

I implemented in a way such that it should be correct work on all architectures. I just use the TargetTransformInfo for the heuristics. Everything else completely target independent.
So, fundamentally wrong I think it is unlikely. If they profit from converting recursion into loops by explicitly modelling the state on the stack, I cannot tell (especially regarding performance).
As I can benchmark only on AArch64 in a suitable way, I prefer to enable this patch only for AArch64 and leave other architectures out for this initial version.

marels retitled this revision from [RecursionStackElimination]: Pass to eliminate recursions to [MultiTailCallElimination]: Pass to eliminate multiple tail calls.Dec 4 2018, 8:10 AM
john.brawn added inline comments.Dec 7 2018, 5:22 AM
lib/Transforms/Scalar/MultiTailCallElimination.cpp
1507–1513

This doesn't even compile. If I turn the condition into ((&*BBI == E) || (&*BBI == TI)) or one of those two on their own I get an assertion failure on ++CandI because CandI is Candidates.end().

john.brawn added inline comments.Dec 7 2018, 7:54 AM
lib/Transforms/IPO/PassManagerBuilder.cpp
547–553

I would think that just before the TailCallEliminationPass would be the place to run this pass, and from experimentation doing it there we don't need to add any other passes around it.

lib/Transforms/Scalar/MultiTailCallElimination.cpp
405–407

It would be simpler to do RecursiveCall(const RecursiveCall &O) = delete.

1507–1513

I looks like maybe the while(true) should be while(CandI != Candidates.end()). With that the tests pass at least.

1626–1627

Should be 'else if' instead of nested 'if'. Also the curly braces are unnecessary here.

marels added inline comments.Dec 10 2018, 3:05 AM
lib/Transforms/Scalar/MultiTailCallElimination.cpp
1507–1513

I have no idea how I messed this up, but I did.

The issue is after my final tests, I did some cosmetics and added. Somehow a couple of lines got deleted. I found them in my local history. Big ups.

Although I am currently onto replying to the comments and will be included the fix the next update, please also find it below

diff --git a/lib/Transforms/Scalar/MultiTailCallElimination.cpp b/lib/Transforms/Scalar/MultiTailCallElimination.cpp
index ab4bff0..2dafdf1 100644
--- a/lib/Transforms/Scalar/MultiTailCallElimination.cpp
+++ b/lib/Transforms/Scalar/MultiTailCallElimination.cpp
@@ -1505,7 +1505,10 @@ private:

       // If we reached the terminator we are done.
       if (&*BBI == TI)
-        (&*BBI == E) {
+        break;
+
+      // The current search intervals end candidate is reached?
+      if (&*BBI == E) {
         StaticInst.push_back(&*BBI);
         // Advance the current iteration interval to next candidate.
         ++CandI;