This is an archive of the discontinued LLVM Phabricator instance.

[AArch64] Homogeneous Prolog and Epilog for Size Optimization
ClosedPublic

Authored by kyulee on Mar 22 2020, 11:31 AM.

Details

Summary

Prolog and epilog to handle callee-save registers tend to be irregular with different immediate offsets, which are not often being outlined (by machine outliner) when optimizing for size. From D18619, combining stack operations stretched irregularity further.
This patch tries to emit homogeneous stores and loads with the same offset for prolog and epilog respectively. We have observed that this homogeneous prolog and epilog significantly increased the chance of outlining, resulting in a code size reduction. However, it was still far from the minimum size code due to requiring the special handling of the return register, x30.
This patch custom-outlines prolog and epilog in place:

  • Injects HOM_Prolog and HOM_Epilog psuedo instructions in Prolog and Epilog Injection Pass
  • Lower and optimize them in AArchLowerHomogneousPrologEpilog Pass
  • Outlined helpers are created on demand. Identical helpers are merged by the linker.
  • An opt-in flag is introduced to enable this feature. Another threshold flag is also introduced to control the aggressiveness of outlining for application's need.

This reduced an average of 4% of code size on LLVM-TestSuite/CTMark targeting arm64/-Oz.

Diff Detail

Event Timeline

kyulee created this revision.Mar 22 2020, 11:31 AM

Hello. I like the idea. It's something we thought about internally but no-one has ever worked on enough to see how much of an improvement it gives in general.

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
229

Can you explain what CompactUnwindFrame is, and why it's needed for this to work? I'm not really an expert on Frame Lowering, but is there a way to get this to work without that restriction?

kyulee added inline comments.Mar 30 2020, 10:22 AM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
229

This guarantees a pair register use and the ordering of CSR save locations are fixed, which simplifies the current logic for correctness. An immediate remedy to this is to support the non-pair register case, but I'm not either an expert for all other platforms and calling conventions, and I'm not sure how I test and validate such a relaxation. I think probably it's better leaving this extension to folks who know the details for their platforms.

kyulee updated this revision to Diff 276315.Jul 7 2020, 11:28 PM

Updating for Linux support.

plotfi added inline comments.Jul 8 2020, 7:40 PM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
1634

Is the bit that was removed a non-functional change here? If so, can this be a separate NFC commit?

plotfi added inline comments.Jul 8 2020, 10:18 PM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
241

This is so small that I feel it would be more descriptive at the call site of SavedRegs.set/test to have:

/// true if CSRs should be paired
const bool producePairRegisters = produceCompactUnwindFrame(MF) || homogeneousPrologEpilog(MF);

With some additional comments on the register paring in the context of homogenous-prolog-epilog.

kyulee updated this revision to Diff 276640.Jul 9 2020, 12:18 AM

Refactor getArgumentPopSize() to D83456
Add comment on producePairRegisters()

kyulee updated this revision to Diff 276644.Jul 9 2020, 12:39 AM
kyulee marked 2 inline comments as done.

Fix for merge

kyulee marked an inline comment as done.Jul 9 2020, 12:41 AM
kyulee added inline comments.
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
241

Updated the comment.

1634

It's refactored to D83456.

@kyulee Update please (since the NFC has landed), so that Harbormaster can run again without conflict.

kyulee updated this revision to Diff 276806.Jul 9 2020, 12:32 PM

Rebase after D83456

plotfi added inline comments.Jul 14 2020, 9:05 PM
llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
250

I think this one could potentially also be a separate NFC as well.

@t.p.northover Hi Tim, would you have some bandwidth to take a look at this patch for homogeneous prolog/epilog callee-save registers opti?

kyulee updated this revision to Diff 278392.Jul 16 2020, 1:19 AM
kyulee marked an inline comment as done.

Fix for Outlined Epilog for a Tail-Call

Added a case x16 is live across the epilog helper, which is bailed-out.

@sdesmalen @efriedma @dmgreen Hi Sander, Eli, Dave. Would any of you have some time to help review this Prolog Epilog Size optimization patch? Much appreciated if you do. Me and @kyulee would be available to chat on IRC or discord about it for more info.

Have you looked at the RISCV prologue/epilogue lowering support? How does this compare?

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
244

A long list of random conditions like this is asking for trouble: someone is inevitably going to miss some case where it needs to be updated in the future. Is this really the only way?

llvm/test/CodeGen/AArch64/arm64-homogeneous-prolog-epilog-frame-tail.ll
72

If we're going to save fp/lr in the caller, can't we just save them to the correct position, instead of copying them to a new location in the outlined function?

Also, in general, pre-increment instructions are more expensive then non-pre-increment instructions. It would be better to rearrange the operations so you can emit exactly one pre-increment instruction, instead of making every instruction pre-increment.

kyulee added a comment.Oct 1 2020, 9:21 AM

@efriedma Unlike other work that uses a helper call to runtime library, this work synthesizes outlined functions with a custom-calling convention. This is handy and flexible to extend this logic for other patterns or purposes like instrumentation, etc inside the helper within the compiler.

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
244

This patch is meant to find a regular/homogeneous stack frame as possible that can be easily outlined for the size. In theory, all these random conditions/restrictions can be relaxed with more thorough implementation, but I think this may be good to start with. Let me know how I can improve this.

llvm/test/CodeGen/AArch64/arm64-homogeneous-prolog-epilog-frame-tail.ll
72

I agree this is not ideal for the Linux case which has the different (opposite) order of CSR than Darwin/iOS which does not require this pattern (see the above without SAVELR case). The initial version disallowed this while supporting compact unwind case (Darwin) only but from @dmgreen suggestion, I added this Linux support, but tries to keep as the same pattern as possible instead of introducing different shape at the call-site. This is an important aspect when we want to extend this homogeneous function entry for other purpose (like instrumentation, etc) which actually I'm working on in another context.

I understand pre/post-inc/dec uses are more expensive, and this code can be optimized with additional complexity.
The goal is to minimize the code by exposing the homogeneous patterns and outlining them at the cost of performance -- in fact, outlining itself may hurt perf significantly on a CPU bound case although page-fault or working set is improved for the large binary with this aggressive size optimization.
@efriedma Can I follow up the specialized/optimized outlined function after this patch? Or I can work on the improved version in place now.

Should this be controlled by the clang -moutline/-mno-outline?

llvm/lib/Target/AArch64/AArch64FrameLowering.cpp
244

I'm not sure how hasVarSizedObjects() or needsStackRealignment() are relevant here.

For the other bits, I guess they're directly related to what you're doing. Maybe leave comments in the relevant parts of prologue and epilogue lowering noting that they need to stay in sync with homogeneousPrologEpilog.

llvm/test/CodeGen/AArch64/arm64-homogeneous-prolog-epilog-frame-tail.ll
72

If fixing this is going to substantially change the way the implementation looks, I'd rather not review it twice.

kyulee updated this revision to Diff 296231.Oct 5 2020, 10:56 AM

This rewrites the helper functions:

  • One SP Adjustment to reduce micro-opts for the better perf.
  • Align Linux/Darwin implementation in the same way . X29/X30 is stored at the call-site and that stack space is skipped in the prolog helper.
  • Add some comments and expand the enabling check.

I'm not sure how we want to enable this via the clang flag -moutline .
I think the initial approach is a bit conservative so that this change is kicked in only when -moutline AND minsize opts are applied. Let me know about this.

@efriedma Could you have a chance to take a look this again?

kyulee updated this revision to Diff 304361.Nov 10 2020, 5:32 PM

Rebase.

I would appreciate if anyone can review this.

t.p.northover added inline comments.Nov 11 2020, 5:46 AM
llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp
216–218

What does adding two identical mem-operands achieve? I assumed it would be treated identically to just one.

304

I think it would be clearer if the LRIter became LRIdx, folding the std::distance thing into the same line as the find (or by another means). We don't really use it as an iterator and spreading the calculation out confuses me.

306

This calculation is quite dense and the dynamic sequence of stores we're aiming for isn't really explicit in any one location (comment at the top of the file?). I had to copy/paste it into a main.cpp and run some test cases before I really got what it was trying to do.

Also, and I think this is part of the same problem, it's not obvious that we even save all registers we need to. The loop looks like it's going to skip the last two elements unless the if is executed, and I had to really think to convince myself that matched the case where the last two are LR/FP and so already saved.

Sitcking to the separate emitStore calls it might be better as

// If the register stored to the lowest address is not LR, we must subtract more from SP here.
if (LRIdx != Size - 2)
  emitStore(...);

though there's an argument for merging them

// If LR is not stored at the lowest address (i.e. the final pair in the list) then
// the caller will only have subtracted part of the callee save area from SP. 
// Our first store must do the rest.
int RemainingSPAdjustment = Size - 2 - LRIdx;

for (int I = Size - 2; I >= 0; I -= 2) {
  if (Regs[I] == AArch64::LR)
    continue;
  emitStore(..., Regs[I], Regs[I+1], Size - 2 - I - RemainingSPAdjustment, RemainingSPAdjustment != 0);
  RemainingSPAdjustment = 0;
}
354

I know we're not hugely concerned about performance (the indexing is minor, for example), but this seems to misalign the BL/RET chain that CPUs often hard-code into their branch predictors. I seem to remember reading that's an absolutely horrible thing to do to a CPU.

Fortunately, the native AArch64::RET instruction does actually take a register operand (it's only difference from BR is precisely that it's a hint to the CPU that you're popping from any microarchitectural return-address stack it may be keeping). So probably best to use that.

368

Is this really a dynamic property? The frame layout is pretty solidly dictated by the ABI so the offset is known from the list of registers we're saving.

Even if it is dynamic, perhaps it's best to encode into HOM_prolog as an operand rather than re-deriving it by grubbing around the basic block.

407

Let the compiler convert a divide to a shift, it's not the 80s.

smeenai added inline comments.Nov 11 2020, 11:49 AM
llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp
407

If you're compiling without assertions, the compiler wouldn't be able to just generate a shift directly (because it has to account for negative numbers), right? You'd need something like a __builtin_assume to get just the shift to be generated. (See https://godbolt.org/z/vzx9hx)

Hi, I may have found a problem.

It says in the official document ---- the setter and getter of properties will become direct if objc_direct_members is placed on the @implementation block:

When objc_direct_members is placed on an @implementation block, methods defined in the block are considered to be declared as direct unless they have been previously declared as non-direct in any interface of the class. This includes the implicit method definitions introduced by synthesized properties, including auto-synthesized properties.

But I found that this way didn't make property's setter or getter became direct method.
I wrote 3 demos, the code is as follows:

//normal_excutable
@interface ViewController ()
@property (nonatomic, copy) NSString *name;
@end

@implementation ViewController
@end
//use_objc_direct_members
@interface ViewController ()
@property (nonatomic, copy) NSString *name;
@end

__attribute__((objc_direct_members))
@implementation ViewController
@end
//use_direct_property
@interface ViewController ()
@property (nonatomic, copy, direct) NSString *name;
@end

@implementation ViewController
@end

Compile them separately and compare their Mach-O products:

From the picture:

  • In normal executable, the name and setName: are in the ViewController's method list, there are not direct methods.
  • When placed __attribute__((objc_direct_members)) above @implementation, nothing is happened, the name and setName: are also not direct methods. This does not seem to be consistent with what is described in the documentation.
  • When used direct in property, it's works well, the name and setName: are not in the method list, they become the direct methods.

So, I don't know if I misunderstood the sentence:

"This includes the implicit method definitions introduced by synthesized properties, including auto-synthesized properties."

Or the document writes error ? Or the objc_direct_members has a bug?

I would be very, very grateful if you could answer my questions.

kyulee updated this revision to Diff 314990.Jan 6 2021, 2:29 PM

Update for @t.p.northover feedbacks

  • Pass FPOffset as an argument to expand it directly
  • Use ret instead of br in the epilog return
  • Clean up iterator use for register indexing
  • Use / instead of >> for the readability with unsigned type.
kyulee marked 7 inline comments as done.Jan 6 2021, 2:40 PM
kyulee added inline comments.
llvm/lib/Target/AArch64/AArch64LowerHomogeneousPrologEpilog.cpp
216–218

In fact, these mem-operands are not needed so I just delete them.

306

I took the former approach for the readability even though the latter merges all these code.

354

Nice to know this! I used RET instead of BR.

368

I just passed FPOffset as additional argument to HOM_Prolog. Indeed your suggestion cleans up the code quite a bit. Thanks!

407

Thanks for the point. As shown in https://godbolt.org/z/Wo5r4W, I use a division with unsigned type for readability, but it becomes a shift.

kyulee updated this revision to Diff 315091.Jan 7 2021, 3:59 AM
kyulee marked 5 inline comments as done.
  • Fix clang-tidy/format issues
kyulee updated this revision to Diff 315227.Jan 7 2021, 1:28 PM
  • Fix pre-merge checks by using variable_ops in pseudo.
kyulee updated this revision to Diff 317043.Jan 15 2021, 12:00 PM
  • Rebase it to the latest
  • Expand the epilog corner case where x16 lives across blocks
  • Fix a comment
kyulee updated this revision to Diff 317086.Jan 15 2021, 3:07 PM
  • Fix clang-tidy from pre-merge check

@ t.p.northover Could you take a look again when you're available? Thanks!

t.p.northover accepted this revision.Feb 1 2021, 4:51 AM

Thanks for applying the suggested fixes. I think I'm happy with it now.

This revision is now accepted and ready to land.Feb 1 2021, 4:51 AM