This is an archive of the discontinued LLVM Phabricator instance.

Add a shrink-wrapping pass to improve the placement of prologue and epilogue.
ClosedPublic

Authored by qcolombet on Apr 22 2015, 3:50 PM.

Details

Summary

Hi,

@Jim, putting you as reviewer as you seemed to have touched PEI more than most people :).

This patch introduces a new pass that computes the safe point to insert the prologue and epilogue of the function.
The interest is to first safe points that are cheaper than the entry and exits blocks.

  • Context **

Currently we insert the prologue and epilogue of the method/function in the entry and exits blocks. Although this is correct, we can do a better job when those are not immediately required and insert them at less frequently executed places.
The job of the shrink-wrapping pass is to identify such places.

  • Motivating example **

Let us consider the following function that perform a call only in one branch of a if:
define i32 @f(i32 %a, i32 %b) {

%tmp = alloca i32, align 4
%tmp2 = icmp slt i32 %a, %b
br i1 %tmp2, label %true, label %false

true:

store i32 %a, i32* %tmp, align 4
%tmp4 = call i32 @doSomething(i32 0, i32* %tmp)
br label %false

false:

%tmp.0 = phi i32 [ %tmp4, %true ], [ %a, %0 ]
ret i32 %tmp.0

}

On AArch64 this code generates (removing the cfi directives to ease readabilities):
_f: ; @f
; BB#0:
stp x29, x30, [sp, #-16]!
mov x29, sp
sub sp, sp, #16 ; =16
cmp w0, w1
b.ge LBB0_2
; BB#1: ; %true
stur w0, [x29, #-4]
sub x1, x29, #4 ; =4
mov w0, wzr
bl _doSomething
LBB0_2: ; %false
mov sp, x29
ldp x29, x30, [sp], #16
ret

With shrink-wrapping we could generate:
_f: ; @f
; BB#0:
cmp w0, w1
b.ge LBB0_2
; BB#1: ; %true
stp x29, x30, [sp, #-16]!
mov x29, sp
sub sp, sp, #16 ; =16
stur w0, [x29, #-4]
sub x1, x29, #4 ; =4
mov w0, wzr
bl _doSomething
add sp, x29, #16 ; =16
ldp x29, x30, [sp], #16
LBB0_2: ; %false
ret

Therefore, we would pay the overhead of setting up/destroying the frame only if we actually do the call.

  • Proposed Solution **

This patch introduces a new machine pass that perform the shrink-wrapping analysis (See the comments at the beginning of ShrinkWrap.cpp for more details). It then stores the safe save and restore point into the MachineFrameInfo attached to the MachineFunction.
This information is then used by the PrologEpilogInserter (PEI) to place the related code at the right place. This pass runs right before the PEI.

Unlike the original paper of Chow from PLDI’88, this implementation of shrink-wrapping does not use expensive data-flow analysis and does not need hack to properly avoid frequently executed point. Instead, it relies on dominance and loop properties.

The pass is off by default and each target can opt-in by setting the EnableShrinkWrap boolean to true in their derived class of TargetPassConfig. This setting can also be overwritten on the command line by using -enable-shrink-wrap.

Before you try out the pass for your target, make sure you properly fix your emitProlog/emitEpilog/adjustForXXX method to cope with basic blocks that are not necessarily the entry block.

  • Design Decisions **
  1. ShrinkWrap is its own pass right now. It could frankly be merged into PEI but for debugging and clarity I thought it was best to have its own file.
  2. Right now, we only support one save point and one restore point. At some point we can expand this to several save point and restore point, the impacted component would then be:
  3. The pass itself: New algorithm needed.
  4. MachineFrameInfo: Hold a list or set of Save/Restore point instead of one pointer.
  5. PEI: Should loop over the save point and restore point.

Anyhow, at least for this first iteration, I do not believe this is interesting to support the complex cases. We should revisit that when we motivating examples.

That being said, the target specific code should not change, which is another point for not blocking the optimization on that :).

  • Feedback Needed **

Right now, I haven’t added any new target hook, but I am wondering if some more would make sense.
In particular:

  1. Should we have a target hook to be able to fix something on the entry block if this one was not the save point?
  2. Same question with all the exit blocks?

For #2, for instance, ARM needs to expand the TCRETURN pseudo-instruction, but this can be done in the expand pseudo pass. Therefore, I do not know if this is actually needed.

  • What Is Next? **

I have patches to enable this for AArch64 and ARM and I think I will look into X86 as well. For the record, this implementation of shrink-wrapping applies on about 20% of the function for both O3 and Os with no-regressions and a few improvements.
PGO would certainly helped, but I haven’t tried.

Thanks for your feedbacks,
-Quentin

Diff Detail

Repository
rL LLVM

Event Timeline

qcolombet updated this revision to Diff 24260.Apr 22 2015, 3:50 PM
qcolombet retitled this revision from to Add a shrink-wrapping pass to improve the placement of prologue and epilogue..
qcolombet updated this object.
qcolombet edited the test plan for this revision. (Show Details)
qcolombet added a reviewer: grosbach.
qcolombet set the repository for this revision to rL LLVM.
qcolombet added a subscriber: Unknown Object (MLST).

Out of curiosity there a reason you didn't just do the standard
AVAIL/ANTIN version of this like PEI does?

I was looking at the old version, but ...

https://llvm.org/svn/llvm-project/llvm/tags/RELEASE_26/lib/CodeGen/ShrinkWrapping.cpp

This is essentially the Chow paper, but

  1. The dataflow is not really expensive (It's O(N), the same as your

current algorithm). If it is, you should do it differently ;) It can
be computed really fast. Definitely faster than computing
post-dominators.

  1. The chow paper handles multiple save/restore points :)

If your concern is #1, that's solvable, easily.
If the concern is "avoiding placement into bad blocks", that's also
solvable. Chow does not use the other lazy code motion calculations
for some reason. If it did, it would not place it into loops :)

In fact, it will give you the placement points, and you can choose
which you want, and then it will give you the insertion/deletions to
make that happen.

Hi Quentin,

Thanks for working on this - I think shrink wrapping is a very useful optimization to have.
My comments are mostly nit picks - feel free to ignore them if they don't make sense as a result of me missing something.

Overall, I guess that this needs support in at least one backend, so regression tests can be added before this is in a state ready to commit?

Thanks,

Kristof

lib/CodeGen/MachineFunction.cpp
626–627

I don't understand what "... than do not cross Restore." means here.
Should the comment just be "Starting from MBB, check if there is a patch leading to Save"?

634–637

"Since we do not reach" -> "If we do not reach"?

lib/CodeGen/PrologEpilogInserter.cpp
379–380

"basic block" -> "basic blocks"

389–392

I'm wondering if the logic overall would get simpler if MFI->getSavePoint() would always return a MachineBasicBlock,
i.e. returning Entry if shrink wrapping didn't change the default save point?
That way, the logic in most functions doesn't need to handle the case separately of whether or not the save block
and entry block are the same - generalizing the logic a bit more.
But maybe there's a good reason to do it like this that I haven't noticed?

lib/CodeGen/ShrinkWrap.cpp
186–210

This seems like the most critical function to get right from a correctness point-of-view?
I can't derive from just staring at this code whether this is also going to catch code generated for VLAs,
or inline assembly fragments touching the frame, or registers such as stack pointer, frame pointer, base pointer.

I'm guessing this could be good cases to add as regression tests - i.e. a case where only the presence of a VLA prevents the optimization & a case where only the presence of a piece of inline assembly prevents the optimization?

I have a patch that does this for Hexagon. It's implemented as a part of PEI, specifically in HexagonFrameLowering::insertPrologue.

The problem with multiple prologs is that this can increase code size, plus it's a lot harder to identify the set of registers that need to be saved in each prolog. You don't want to be saving more registers than you need.

One other concern I have is that the way PEI is implemented leaves little room for a target to decide what to do. With the shrink-wrapping in place, we either get none of it or all of it. For most applications on Hexagon, both code size and performance are important. The decision as to whether we want multiple prologs may be the result of an analysis of the entire function. As a matter of fact we may prefer to have a single prolog, but not in the entry block even if the shrink-wrap pass finds multiple locations.

Hi Kristof,

Thanks for your feedbacks.

Overall, I guess that this needs support in at least one backend, so regression tests can be added before this is in a state ready to commit?

Whatever people prefers. I have a patch to add the support for shrink-wrapping for AArch64, I can just merge it with this one with the path disabled by default.

Cheers,
-Quentin

lib/CodeGen/MachineFunction.cpp
626–627

That is a typo indeed, this is “that do not cross Restore” :).
This is the important part in fact. Restore is the kill point of the region that needs CSR to be saved and past the kill point, there is nothing to track.

634–637

That works too!

lib/CodeGen/PrologEpilogInserter.cpp
389–392

The rational was that this is homogenous with the way we handle the Restore point. I.e., current shrink-wrapping just give Restore point, whereas PEI can have multiple of them. So I did not want to unbalanced this.

I guess I can also push the list of exits blocks in MFI if you think that is better.

lib/CodeGen/ShrinkWrap.cpp
186–210

If the callee-saved information is properly set, it should… Not sure how the VLA are handled, definitely a good regression test to add!
For inline assembly, I guess you are right and we could play it safe.

Additionally, we could add a target hook to check whether or not a given instruction should be after the prologue and before the epilogue. At the moment, I had needed this but if it proves to be useful we can definitely do that.

Hi Krzysztof,

qcolombet updated this revision to Diff 24424.Apr 24 2015, 6:38 PM
  • Fix typos.
  • Add the support for AArch64
  • Add regressions tests for AArch64.
  • Improve the debug messages.
  • Fix the handling of empty functions.
qcolombet added inline comments.Apr 24 2015, 6:42 PM
lib/CodeGen/MachineFunction.cpp
626–627

Do I still need to do something about the comment?

lib/CodeGen/Passes.cpp
61

clang-format related. I can back that out.

lib/CodeGen/PrologEpilogInserter.cpp
389–392

What do you think?
Should I:

  1. Keep Restore and Save as they are?
  2. Create an asymmetry between their handling? (Don’t like that.)
  3. Push everything into MachineFrameInfo?
lib/CodeGen/ShrinkWrap.cpp
186–210

The regression tests did not show any problem with the current implementation.

kristof.beyls added inline comments.Apr 27 2015, 5:04 AM
lib/CodeGen/MachineFunction.cpp
626–627

If the comment reads "Starting from MBB, check if there is a path leading to Save that does not cross Restore", that makes sense to me.

lib/CodeGen/PrologEpilogInserter.cpp
389–392

Looking further into the details of how PEI is implemented at the moment, I'm starting to think that the best option probably is:

  • Get rid of the EntryBlock variable in class PEI, and replace it/change its name to "SaveBlock".
  • Get rid of the ReturnBlocks variable in class PEI, and replace it/change its name to "RestoreBlocks".

That way, the names of the variables reflect the intent better, i.e. the block where saves and restores need to be inserted.
This can be done as a separate NFC-patch.
After that, the shrink wrapping pass "just" optimizes the SaveBlock and RestoreBlocks values. With the current implementation, if the shrink wrapping actually does any optimization, RestoreBlocks will be a single-element vector.

Doing it this way, I think PEI code should not have to special case (i.e. have if statements) based on whether or not the shrink wrapping optimization happened or not. I hope.

I guess this corresponds to option 3, "push everything into MachineFrameInfo"?

kristof.beyls added inline comments.Apr 27 2015, 5:18 AM
lib/Target/AArch64/AArch64FrameLowering.cpp
543–544

I guess this code sequence will lead to DL sometimes being uninitialized.
I'm not sure when "MBB.end() != MBBI", but in that situation, can't you get a reasonable debug location from somewhere else?

If the RetOpcode variable is only used to detect tail call returns, maybe it's better to replace it with a bool variable called "isTailCallReturn" or something similar,
and use that later on, instead of explicitly checking what the value is?

qcolombet added inline comments.Apr 27 2015, 9:52 AM
lib/Target/AArch64/AArch64FrameLowering.cpp
543–544

I guess I can use MBB.findDebugLoc() with some sensible iterator.
I will check.

Same for RetOpcode, thanks for the feedback.

qcolombet updated this revision to Diff 24514.Apr 27 2015, 4:52 PM
qcolombet removed rL LLVM as the repository for this revision.
  • Use boolean instead of unsigned to determine whether or not this is a tail call return.
qcolombet added inline comments.Apr 28 2015, 9:58 AM
lib/Target/AArch64/AArch64FrameLowering.cpp
543–544

I haven’t changed the handling of the debug information. It is indeed consistent with what is done AArch64FrameLowering::spillCalleeSavedRegisters and AArch64FrameLowering::restoreCalleeSavedRegisters.
Therefore, I think that if we want to fix that, we should do it consistently at these three places.

What do you think?

Ping?

Thanks,
-Quentin

Hi Quentin,

As far as I can see, the only comment made that hasn't been addressed yet
is about renaming/replacing EntryBlock and ReturnBlocks in class PEI with
SaveBlock/RestoreBlocks. It seems PEI only needs to know what the Save and
Restore blocks are, and doesn't need to know which ones are the Entry or
Return blocks. Therefore, it doesn't seem a good idea to retain the EntryBlock
and ReturnBlocks variables in that class. But maybe you already looked
into that and there's a reason why EntryBlock and ReturnBlocks is needed
after all?

Apart from the above comment, which I think needs to be addressed, the patch
looks good to me.

Thanks,

Kristof

qcolombet updated this revision to Diff 24926.May 4 2015, 4:58 PM

Update the EntryBlock, ReturnBlocks fields in PEI.

Overall, it seems you've made a lot of changes in the Hexagon backend. I guess these are necessary to make the backend structured more like the other backends so that the SaveBlock/RestoreBlock changes in PEI can be done?
If so, are the changes basically refactoring, i.e. moving existing code to be placed behind certain interfaces, or did you need to write a lot of new code in the Hexagon backend that should be reviewed in detail?

lib/CodeGen/PrologEpilogInserter.cpp
145–160

Just a minor comment, feel free to ignore this one:
I feel that it would be a slightly better separation of concerns if MFI->getSavePoint() would always be set, and PEI doesn't need to calculate a SavePoint itself when it isn't set.
For this to happen, the ShrinkWrap pass should probably always be run, and when it's "switched off be default", it would just set MFI->SavePoint and MFI->Restore to
be equal to FN.begin() and to the ReturnBlocks respectively.
I.o.w., The second half of the code above would be executed in the ShrinkWrap pass.
If you'd push ahead and implement it like this, the name of the pass also should be changed to "SaveRestorePointInserter/Calculator" or something similar. It's then a pass
which always calculates the most appropriate Save and Restore blocks and happens to implement a ShrinkWrapping optimization.
As I said, maybe this is pushing for too much separation of concerns - I'm not sure what the negative consequences would be to require the "SaveRestorePointInserterPass" to always be run, if any.

785–786

s/RetBlock/RestoreBlock/?
The basic blocks iterated through are "Restore blocks", not necessarily "Return blocks"?

lib/Target/ARM/ARMFrameLowering.cpp
1691–1693

It's unclear to me why this change is needed?
If only implementing the AArch64-backend-specific functionality, how come you need to make a change in the AArch32-backend?

qcolombet accepted this revision.May 5 2015, 10:43 AM
qcolombet added a reviewer: qcolombet.

Thanks all for the feedbacks.

Committed revision 236507.

Cheers,
-Quentin

This revision is now accepted and ready to land.May 5 2015, 10:43 AM
mcrosier closed this revision.Jun 23 2015, 8:31 AM
mcrosier added a subscriber: mcrosier.

Committed in r236507.