This pass converts multiple tail recursive calls into a loop, by modeling the
calls as a single linked worklist explicitly on the stack.
void f(a, b, const c) { ... if (...) return a1, b1, b2, b3, x2, x3 = ... f(a1, b1, c) // 1st recursion a2 = h(x2) f(a2, b2, c) // 2nd recursion a3 = h(x3) f(a3, b3, c) // 3rd recursion }
transforms to
void f(a, b, const c) { struct { x, b, next } *worklist = null loop: ... if (...) goto check_return a1, b1, b2, b3, x2, x3 = ... /* Assign arguments for the first call */ a = a1 b = b1 /* Put arguments of the remaining calls into the work list */ queue = alloca(2 * sizeof(*worklist)) queue[0] = {x2, b2, &queue[1]} queue[1] = {x3, b3, worklist} worklist = queue goto loop check_return: if (!worklist) return a = h(worklist->x) b = worklist->b worklist = worklist->next goto loop }
Such patterns occur for example when an application traverses k-ary full trees.
The benefits of this transformation is, that neither frame nor link address
have to be stored on the stack. Also pass through arguments, like 'const c'
in example above, are less likely required to saved onto the stack, and if so
less often (only once in the entry block).
The downsides are:
- a) An additional store for the worklist is required.
- b) The worst case required stack memory after this transformation depends on the number of recursive calls, instead of the recursion depth.
- c) Additional conditional branches
- ad a) This point is compensated by avoiding storing the link address in each call.
- ad b) This pass additionally can add (and does it by default) code to manage unused blocks of allocas in a freelist. Because allocations are done in blocks the freelist management, can also be done in blocks. This requires the last item in each block to be marked, detectable by the code within return. Currently the following mark algorithms are implemented: A further variable is stored on the stack containing the current freelist. On the other hand by not executing a recursion the frame pointer loads and stores are omitted.
- ad c) The pass adds 1 conditional branch into the return path and 2 additional branches for freelist management (see (b) above). Depending on the target, machine branch prediction can elevate this.
Algorithm outline:
Analysis Phase:
- Analyze the function and gather the returning basic blocks (BB) (and BB branching to return-only BB) with recursive calls (RC).
- If more the one BB or no RC is found abandon the transformation.
- If the remaining BB has only one RC abandon the transformation. Otherwise let N be number of RC in this BB.
- Analyze the instructions in this BB and from the first RC until the terminator instruction and classify each instruction as movable and static. A movable instruction is and instruction that can be safely moved before the first RC. All other instructions are classified static.
- Assign each static instruction to the following RC instruction. If static instructions are left after the last RC abandon the transformation.
- Build the function H with all its arguments for each RC. By including the call itself in function H it is ensured that this function and enforcing this function to return void. It is ensured that there are no escaping values uses after the recursion. Note (6): By the way step 4 is executed it is guaranteed that function H for the first RC consists of a single instruction; the call itself. The first call candidate is handled special (the same way as in Tail Recursion Elimination (TRE)). Note (5,6): The information collected on each RC is collected in the structure RecursiveCall.
- Compare the second's function H with all later ones. The behavior must match, otherwise abandon the transformation.
- Note: As the first RCs function H basically a TRE it can be ignored in this step.
- Simplify the argument list by removing constants and pass through arguments.
- Decide whether it is profitable to use the transformation.
Transformation Phase:
- Adjust entry block and split of the loop entry.
- Eliminate the first RC (similar to TRE).
- Eliminate the remaining RC by allocating and filling an array new (or pick it from the freelist) block of N-1 struct items. This array is put in the front of the list. Pulling the list is added in (4). The execution of function H is ensured in (5).
- Create a new return block which pulls items from the worklist. If an and of block marker is reached. The block is put into the freelist.
- Add the instruction from function H into the return block and create the loop.
- Redirect each returning block into the new return block created in (4).
- Drop the constant STACKGROWTHDIRECION. It is manly uses as a proof of concept for Aarch64.
Open issues and known TODOs - It would be great if reviewer could comment on
those as well:
- Pipeline integration: Currently the pass is put before TRE. This includes some supporting passes, for cleanup and preparation. This cannot be left as is. The preferred way could be by adjusting "AArch64TargetMachine::adjustPassManager" and only use it with Opt3 . AAarch64 is selected, because this is the architecture the author has suitable benchmarking and test setups available. This was tried once (in a similar way as in AMDGPUTargetMachine::adjustPassManager), however the result was that may LLVM projects did not compile anymore because of linker problems (Passes library was missing). Do you have any advise here?
- The way to test if it profitable to use the pass needs adjustment. I think that functions, that spill more registers have an increased chance to profit, while functions that spill less, have lower chance for profit.
- Thinking of a configurable way (maybe a separate marker class) to adjust the way markers implemented. E.g.: putting the marker bit into the pointer on Aarch64 show a significant performance boost (test implemented in C).
- Is it safe to temporary create function and return no-changes after deleting it? If not is there a better way to than calling using the FunctionComparator?
- GlobalsAA needs to preserved. Not sure about this in this context. Loads and Stores are added here.
I would think that just before the TailCallEliminationPass would be the place to run this pass, and from experimentation doing it there we don't need to add any other passes around it.