Page MenuHomePhabricator

[AArch64] Homogeneous Prolog and Epilog for Size Optimization
Needs ReviewPublic

Authored by kyulee on Mar 22 2020, 11:31 AM.



Prolog and epilog to handle callee-save registers tend to be irregular with different immediate offsets, which are not often being outlined (by machine outliner) when optimizing for size. From D18619, combining stack operations stretched irregularity further.
This patch tries to emit homogeneous stores and loads with the same offset for prolog and epilog respectively. We have observed that this homogeneous prolog and epilog significantly increased the chance of outlining, resulting in a code size reduction. However, it was still far from the minimum size code due to requiring the special handling of the return register, x30.
This patch custom-outlines prolog and epilog in place:

  • Injects HOM_Prolog and HOM_Epilog psuedo instructions in Prolog and Epilog Injection Pass
  • Lower and optimize them in AArchLowerHomogneousPrologEpilog Pass
  • Outlined helpers are created on demand. Identical helpers are merged by the linker.
  • An opt-in flag is introduced to enable this feature. Another threshold flag is also introduced to control the aggressiveness of outlining for application's need.

This reduced an average of 4% of code size on LLVM-TestSuite/CTMark targeting arm64/-Oz.

Diff Detail

Unit TestsFailed

380 mslinux > HWAddressSanitizer-x86_64.TestCases::sizes.cpp
Script: -- : 'RUN: at line 3'; /mnt/disks/ssd0/agent/llvm-project/build/./bin/clang --driver-mode=g++ -m64 -gline-tables-only -fsanitize=hwaddress -fuse-ld=lld -mcmodel=large -mllvm -hwasan-globals -mllvm -hwasan-use-short-granules -mllvm -hwasan-instrument-landing-pads=0 -mllvm -hwasan-instrument-personality-functions /mnt/disks/ssd0/agent/llvm-project/compiler-rt/test/hwasan/TestCases/sizes.cpp -nostdlib++ -lstdc++ -o /mnt/disks/ssd0/agent/llvm-project/build/projects/compiler-rt/test/hwasan/X86_64/TestCases/Output/sizes.cpp.tmp

Event Timeline

kyulee created this revision.Mar 22 2020, 11:31 AM

Hello. I like the idea. It's something we thought about internally but no-one has ever worked on enough to see how much of an improvement it gives in general.


Can you explain what CompactUnwindFrame is, and why it's needed for this to work? I'm not really an expert on Frame Lowering, but is there a way to get this to work without that restriction?

kyulee added inline comments.Mar 30 2020, 10:22 AM

This guarantees a pair register use and the ordering of CSR save locations are fixed, which simplifies the current logic for correctness. An immediate remedy to this is to support the non-pair register case, but I'm not either an expert for all other platforms and calling conventions, and I'm not sure how I test and validate such a relaxation. I think probably it's better leaving this extension to folks who know the details for their platforms.

kyulee updated this revision to Diff 276315.Jul 7 2020, 11:28 PM

Updating for Linux support.

plotfi added inline comments.Jul 8 2020, 7:40 PM

Is the bit that was removed a non-functional change here? If so, can this be a separate NFC commit?

plotfi added inline comments.Jul 8 2020, 10:18 PM

This is so small that I feel it would be more descriptive at the call site of SavedRegs.set/test to have:

/// true if CSRs should be paired
const bool producePairRegisters = produceCompactUnwindFrame(MF) || homogeneousPrologEpilog(MF);

With some additional comments on the register paring in the context of homogenous-prolog-epilog.

kyulee updated this revision to Diff 276640.Jul 9 2020, 12:18 AM

Refactor getArgumentPopSize() to D83456
Add comment on producePairRegisters()

kyulee updated this revision to Diff 276644.Jul 9 2020, 12:39 AM
kyulee marked 2 inline comments as done.

Fix for merge

kyulee marked an inline comment as done.Jul 9 2020, 12:41 AM
kyulee added inline comments.

Updated the comment.


It's refactored to D83456.

@kyulee Update please (since the NFC has landed), so that Harbormaster can run again without conflict.

kyulee updated this revision to Diff 276806.Jul 9 2020, 12:32 PM

Rebase after D83456

plotfi added inline comments.Jul 14 2020, 9:05 PM

I think this one could potentially also be a separate NFC as well.

@t.p.northover Hi Tim, would you have some bandwidth to take a look at this patch for homogeneous prolog/epilog callee-save registers opti?

kyulee updated this revision to Diff 278392.Jul 16 2020, 1:19 AM
kyulee marked an inline comment as done.

Fix for Outlined Epilog for a Tail-Call

Added a case x16 is live across the epilog helper, which is bailed-out.

@sdesmalen @efriedma @dmgreen Hi Sander, Eli, Dave. Would any of you have some time to help review this Prolog Epilog Size optimization patch? Much appreciated if you do. Me and @kyulee would be available to chat on IRC or discord about it for more info.

Have you looked at the RISCV prologue/epilogue lowering support? How does this compare?


A long list of random conditions like this is asking for trouble: someone is inevitably going to miss some case where it needs to be updated in the future. Is this really the only way?


If we're going to save fp/lr in the caller, can't we just save them to the correct position, instead of copying them to a new location in the outlined function?

Also, in general, pre-increment instructions are more expensive then non-pre-increment instructions. It would be better to rearrange the operations so you can emit exactly one pre-increment instruction, instead of making every instruction pre-increment.

kyulee added a comment.Oct 1 2020, 9:21 AM

@efriedma Unlike other work that uses a helper call to runtime library, this work synthesizes outlined functions with a custom-calling convention. This is handy and flexible to extend this logic for other patterns or purposes like instrumentation, etc inside the helper within the compiler.


This patch is meant to find a regular/homogeneous stack frame as possible that can be easily outlined for the size. In theory, all these random conditions/restrictions can be relaxed with more thorough implementation, but I think this may be good to start with. Let me know how I can improve this.


I agree this is not ideal for the Linux case which has the different (opposite) order of CSR than Darwin/iOS which does not require this pattern (see the above without SAVELR case). The initial version disallowed this while supporting compact unwind case (Darwin) only but from @dmgreen suggestion, I added this Linux support, but tries to keep as the same pattern as possible instead of introducing different shape at the call-site. This is an important aspect when we want to extend this homogeneous function entry for other purpose (like instrumentation, etc) which actually I'm working on in another context.

I understand pre/post-inc/dec uses are more expensive, and this code can be optimized with additional complexity.
The goal is to minimize the code by exposing the homogeneous patterns and outlining them at the cost of performance -- in fact, outlining itself may hurt perf significantly on a CPU bound case although page-fault or working set is improved for the large binary with this aggressive size optimization.
@efriedma Can I follow up the specialized/optimized outlined function after this patch? Or I can work on the improved version in place now.

Should this be controlled by the clang -moutline/-mno-outline?


I'm not sure how hasVarSizedObjects() or needsStackRealignment() are relevant here.

For the other bits, I guess they're directly related to what you're doing. Maybe leave comments in the relevant parts of prologue and epilogue lowering noting that they need to stay in sync with homogeneousPrologEpilog.


If fixing this is going to substantially change the way the implementation looks, I'd rather not review it twice.

kyulee updated this revision to Diff 296231.Oct 5 2020, 10:56 AM

This rewrites the helper functions:

  • One SP Adjustment to reduce micro-opts for the better perf.
  • Align Linux/Darwin implementation in the same way . X29/X30 is stored at the call-site and that stack space is skipped in the prolog helper.
  • Add some comments and expand the enabling check.

I'm not sure how we want to enable this via the clang flag -moutline .
I think the initial approach is a bit conservative so that this change is kicked in only when -moutline AND minsize opts are applied. Let me know about this.

@efriedma Could you have a chance to take a look this again?

kyulee updated this revision to Diff 304361.Tue, Nov 10, 5:32 PM


I would appreciate if anyone can review this.

t.p.northover added inline comments.Wed, Nov 11, 5:46 AM

What does adding two identical mem-operands achieve? I assumed it would be treated identically to just one.


I think it would be clearer if the LRIter became LRIdx, folding the std::distance thing into the same line as the find (or by another means). We don't really use it as an iterator and spreading the calculation out confuses me.


This calculation is quite dense and the dynamic sequence of stores we're aiming for isn't really explicit in any one location (comment at the top of the file?). I had to copy/paste it into a main.cpp and run some test cases before I really got what it was trying to do.

Also, and I think this is part of the same problem, it's not obvious that we even save all registers we need to. The loop looks like it's going to skip the last two elements unless the if is executed, and I had to really think to convince myself that matched the case where the last two are LR/FP and so already saved.

Sitcking to the separate emitStore calls it might be better as

// If the register stored to the lowest address is not LR, we must subtract more from SP here.
if (LRIdx != Size - 2)

though there's an argument for merging them

// If LR is not stored at the lowest address (i.e. the final pair in the list) then
// the caller will only have subtracted part of the callee save area from SP. 
// Our first store must do the rest.
int RemainingSPAdjustment = Size - 2 - LRIdx;

for (int I = Size - 2; I >= 0; I -= 2) {
  if (Regs[I] == AArch64::LR)
  emitStore(..., Regs[I], Regs[I+1], Size - 2 - I - RemainingSPAdjustment, RemainingSPAdjustment != 0);
  RemainingSPAdjustment = 0;

I know we're not hugely concerned about performance (the indexing is minor, for example), but this seems to misalign the BL/RET chain that CPUs often hard-code into their branch predictors. I seem to remember reading that's an absolutely horrible thing to do to a CPU.

Fortunately, the native AArch64::RET instruction does actually take a register operand (it's only difference from BR is precisely that it's a hint to the CPU that you're popping from any microarchitectural return-address stack it may be keeping). So probably best to use that.


Is this really a dynamic property? The frame layout is pretty solidly dictated by the ABI so the offset is known from the list of registers we're saving.

Even if it is dynamic, perhaps it's best to encode into HOM_prolog as an operand rather than re-deriving it by grubbing around the basic block.


Let the compiler convert a divide to a shift, it's not the 80s.

smeenai added inline comments.Wed, Nov 11, 11:49 AM

If you're compiling without assertions, the compiler wouldn't be able to just generate a shift directly (because it has to account for negative numbers), right? You'd need something like a __builtin_assume to get just the shift to be generated. (See