Page MenuHomePhabricator

[PowerPC] Try to move the stack pointer update instruction later in the prologue and earlier in the epilogue (Version 2)

Authored by stefanp on Jan 26 2018, 11:02 AM.



This is the second attempt to move the stdu instruction in the prologue and epilogue.
The first attempt was D41737 but that contained a bug that was exposed by "make test" in libvpx.

In order to fix that issue the transformation is turned off for functions that require frame index scavenging. In order to determine that information the requiresFrameIndexScavenging had to be implemented for PowerPC.

Diff Detail


Event Timeline

stefanp created this revision.Jan 26 2018, 11:02 AM
lei added inline comments.Jan 30 2018, 8:57 AM
316 ↗(On Diff #131626)

maybe an early exit here instead...
if (FrIdx >=0) continue;

syzaara added inline comments.Jan 30 2018, 11:09 AM
842 ↗(On Diff #131626)

form -> from

852 ↗(On Diff #131626)

Can use a range based for loop.

1396 ↗(On Diff #131626)

Can use range based for loop here as well.

stefanp updated this revision to Diff 132261.Jan 31 2018, 1:39 PM

Fixed the issues mentioned in previous comments.

stefanp marked 4 inline comments as done.Jan 31 2018, 1:40 PM
nemanjai added inline comments.Feb 1 2018, 3:46 PM
313 ↗(On Diff #132261)

We use spaces around binary/assignment operators. Maybe just clang-format-diff the patch.

320 ↗(On Diff #132261)

I'm not a fan of this solution. It provides yet another place we check for the register class for a physical register without a clear explanation for *why* we care about the register class.

I'd much prefer a unified solution between StoreRegToStackSlot(), isStoreToStackSlot() and requiresFrameIndexScavenging(). What I'm thinking is something along the lines of:

static const unsigned OpcodesForSpills[] = { PPC::STD, PPC::STW, ... };
PPCInstrInfo::getOpcodeForSpill(unsigned Reg, const TargetRegisterClass *RC = nullptr);

That way we'd have a single definitive list of opcodes that are used for spilling registers and wouldn't have to keep this delicate dance of keeping multiple functions in sync.

  • isStoreToStackSlot() would just check the array to see if its opcode is in there
  • StoreRegToStackSlot() and requiresFrameIndexScavenging() would use getOpcodeForSpill() with the register class or physical register respectively
  • getOpcodeForSpill() would just compute the index into the array based on the register class and target features and return the respective element

Of course, it doesn't have to be done that way, but any solution that would unify this would be good.

nemanjai added inline comments.Feb 2 2018, 5:51 AM
857 ↗(On Diff #132261)

I imagine this is impossible, but it may not be a terrible idea to assert that you haven't somehow iterated past a terminator.

1387 ↗(On Diff #132261)

There are a few places where there are extra spaces in comments. But I'm sure that it'll all get fixed up when you run clang-format-diff.

1398 ↗(On Diff #132261)

This probably applies both here and above, but doesn't the condition FrIdx > 0 actually mean we should not do this at all? I'm not sure if it's possible, but what if you had frame indices that are say { negative, negative, positive, negative } as you iterate? Wouldn't you want to leave the function alone then?

Also, what is it about FrIdx == 0 that prevents you from moving the stack pointer update up past it? For that matter, there should be a comment why we require the condition at all.

Finally, please add a comment as to why it is not necessary to manually update the offsets for the CSR spills and restores - we just move the stack ptr update and something else figures out the offsets.

stefanp updated this revision to Diff 140108.Mar 28 2018, 11:11 AM

After some code cleanup that was required for requiresFrameIndexScavenging here is the new version of this patch.
It should take into consideration all of the reviewer's comments.

Overall I'm very happy with how this patch looks now. The cleanup really allowed this patch to flow much more clearly. However, I think you still haven't addressed the comment

327 ↗(On Diff #140108)


inouehrs added inline comments.Apr 12 2018, 10:17 PM
1552 ↗(On Diff #140108)

If the latency of MTLR is the critical path in the epilogue, can we move this load before stack pointer update (with adjusted offset) to hide the latency further? (But this can be a separate patch.)

stefanp added inline comments.Apr 16 2018, 10:41 AM
1552 ↗(On Diff #140108)

That's a good point.
When I did the original performance tests I had moved the the mtlr past the callee restores and not all the way past the stack pointer update.
I'm going to update this patch without that change and then I'll put in another patch with that change alone.

stefanp updated this revision to Diff 142663.Apr 16 2018, 10:43 AM

I've updated the negative frame indices to use isFixedObjectIndex which is cleaner.

lei added inline comments.Apr 26 2018, 1:32 PM
855 ↗(On Diff #142663)

This if can either be merged into the one below... if (FrIdx <0 && MFI.isFixedObjectIndex ...)
or do an early exit of this loop iteration if (FrIdx >=0) continue;

lei added inline comments.Apr 26 2018, 1:45 PM
1399 ↗(On Diff #142663)

merge this if statement with the nested if

1387 ↗(On Diff #132261)

I don't think you addressed this issue with the extra spaces in comments. From what I can see we don't put an extra space after // when continuing a comment from the previous line.

stefanp updated this revision to Diff 145058.May 3 2018, 11:12 AM

Address comments from previous review.

inouehrs added inline comments.Jun 7 2018, 1:45 AM
1240 ↗(On Diff #145058)

Since we have only one MovingStackUpdateDown, all CSIs are modified back even only a part of them were update above.

nemanjai added inline comments.Aug 14 2018, 10:23 AM
1240 ↗(On Diff #145058)

@stefanp Can you respond to this comment? Is this a problem? Can we get into a situation where we end up with incorrect offsets if not all of the callee-saved spills/restores have been moved/updated?

stefanp updated this revision to Diff 169076.Oct 10 2018, 1:08 PM

I'm sorry it took so long for me to look at this review.

I think you are correct Hiroshi in that what I was doing is not safe. I've added a check where the operation is aborted if not all of the Callee Saved Info is updated at the same time.
Also added a couple of missing lines to getStoreOpcodeForSpill and getLoadOpcodeForSpill that was missed when the Signal Processing Engine work was done. This was exposed by this patch.

nemanjai requested changes to this revision.Dec 29 2018, 3:17 PM

I think this is close to ready but there are a few comments that have to be addressed. Also, @hfinkel and @inouehrs do you see anything that needs to be handled in addition to what I commented on?

866 ↗(On Diff #169076)

No need to make the reader work it out. Add something like this to the comment:

(i.e. a frame size of 32768 bytes satisfies isLargeFrame here,
but not in the epilogue inserter).
867 ↗(On Diff #169076)

Why do we need this? How come we can't just use RegInfo defined above? It should be the same object should it not?

875 ↗(On Diff #169076)

This needs a comment explaining why we need this check and also why failing this check with one of the Callee Saved Register spills doesn't require us to abort the operation the way failing the subsequent check does.

1421 ↗(On Diff #169076)

The condition here looks like a repeat of the one in the prologue inserter. Can we actually extract this into a function that can be queried by both? Then this can be changed to something like:
if (stackUpdateCanBeMoved()) or something equivalently simple.

1426 ↗(On Diff #169076)

Same comment as a similar check in the prologue inserter.

325 ↗(On Diff #169076)

"saved info" doesn't really mean anything in this context. The CalleeSavedInfo type is a structure containing information about Callee Saved Registers.

329 ↗(On Diff #169076)

Same as above.

332 ↗(On Diff #169076)


345 ↗(On Diff #169076)

I am not sure I really follow when we need FI scavenging...

I was under the impression that we might need it if:

  • The frame size is too large so we need to use an X-Form store/load for the spill/restore
  • The alignment of the spill/restore is lower than what is required by the D-Form (4 for DS-Form, 16 for DQ-Form)
  • The only opcode for the spill/restore is an X-Form

This if statement certainly seems to accomplish the last of those but I don't see anything that accomplishes the other two. The first can certainly be an early exit before the loop.

This revision now requires changes to proceed.Dec 29 2018, 3:17 PM
hfinkel added inline comments.Dec 30 2018, 8:58 AM
866 ↗(On Diff #169076)

I think this is restricted to frames where the relevant saves (i.e., whatever is stored before the stack-pointer update is performed) must be less than the red-zone size, which at least under ELF v2 is 288 bytes (see ABI spec: Protected Zone). Maybe this is handled below somehow, but so long as we have a comment here explaining everything, we should explain how this factors in as well.

It is, however, not clear to me where you are checking this below. If you're not, I'm pretty sure that's a significant problem. If we're saving more than just GPRs, we can easily exceed 288 bytes. An explicit test case this would also be a very-good idea.

nemanjai added inline comments.Dec 30 2018, 2:52 PM
866 ↗(On Diff #169076)

We are not actually spilling into the Red Zone here (or at least not if we wouldn't in the original code) but into the current stack frame. This just moves the stack pointer update instruction around. However, the idea is to save into the same stack frame/slot by adjusting the offset in each spill to account for how much the stack pointer will be updated by.
Namely, change something like (the actual values have nothing to do with reality, just an illustration):

stdu 1, -48(1)
std 29, 16(1)
std 30, 8(1)

to something like

std 29, -32(1)
std 30, -40(1)
stdu 1, -48(1)

That way all the stores can be dispatched in parallel rather than the CSR spills for R29 and R30 waiting for the update to R1 to actually complete. But we are still saving everything to the exact same memory location.

But I agree that we should add further testing to clearly illustrate such a change in behaviour.

hfinkel added inline comments.Dec 30 2018, 5:21 PM
866 ↗(On Diff #169076)

We are not actually spilling into the Red Zone here (or at least not if we wouldn't in the original code) but into the current stack frame.

But we are, as you've illustrated with your example...

stdu 1, -48(1)
std 29, 16(1)
std 30, 8(1)

here we adjust the stack pointer first, and all stores are to addresses above the stack pointer.

std 29, -32(1)
std 30, -40(1)
stdu 1, -48(1)

here we adjust it afterward, and so the first two stores are below the current stack pointer. That's into the red zone. It becomes not the red zone only after you adjust the stack pointer. But, if we have:

std 29, -332(1)
std 30, -340(1)
; interrupt handler runs here!
stdu 1, -348(1)

(I made the numbers bigger to illustrate the issue)

then we have a potential problem. The FP save area can be up to 256 bytes, and then comes the GPR area. Then the vector area (which can be up to 512 bytes). Thus, we can certainly go beyond 288 bytes.

stefanp updated this revision to Diff 183100.Jan 23 2019, 7:59 AM

Addressed reviewer comments.
Limited optimization to cases where the fixed called saved regs fit in the red zone of 288 bytes.

jsji added a comment.Jan 23 2019, 8:42 AM

Some quick notes.

786 ↗(On Diff #183100)

Can we use getRedZoneSize instead of hardcoded 288 here?

790 ↗(On Diff #183100)

We should check ABI before checking framesize? As the red zone size is different for different ABIs.
eg: DarwinABI has a 224-byte red zone. PPC32 SVR4ABI(Non-DarwinABI) has no red zone and PPC64 SVR4ABI has a 288-byte red zone.

stefanp updated this revision to Diff 183590.Jan 25 2019, 12:01 PM

Addressed comments from Jinsong.

I've removed the magic number and I'm using getRedZoneSize() now.
I've also moved the isPPC64() check up as well as adding an isELFv2ABI() check.

Please add the following test cases:

  1. A stack frame that is just a bit larger than required
  2. A stack frame with a CSR save that is not a fixed object
  3. A stack frame where we have to use scavenging
  4. A test case where everything fits and it saves multiple register types (GPR, FPR, VSR)

I would recommend that you produce the CHECK directives using the script in utils/ and commit it first as an NFC patch so that this patch clearly shows the differences in the code emitted.

515 ↗(On Diff #183590)

This seems to duplicate code. Can we not implement the update version in terms of this function? Seems that the only thing that slightly complicates things is the computation of maxCallFrameSize so the update form should probably have that as a pointer output parameter with a nullptr default.

777 ↗(On Diff #183590)

Combine the subtarget feature check early exits.

801 ↗(On Diff #183590)


809 ↗(On Diff #183590)

It is not clear why? I presume because the scavenger can add further spills? In any case, the comment should mention why.
return !RegInfo->requiresFrameIndexScavenging(MF);

973 ↗(On Diff #183590)

Won't we be broken if we do this to some of the CSR's and then we bail because one of them doesn't satisfy this condition below? It seems that we need to either defer any offset updates until we are sure we're moving the stack ptr update or keep track of any we updated so we can backtrack if we need to.

78 ↗(On Diff #183590)

callee saved instructions sounds odd. Perhaps CSR save/restore code?

335 ↗(On Diff #183590)

I think a constant such as 0x7FFF communicates intent more clearly. Furthermore, a condition such as this may clearly illustrate that we can't have any higher bits set:
if (FrameSize & ~0x7FFF)

Committed revision 353994 with the test cases as NFC.
Will update the patch for the remaining comments soon...

stefanp updated this revision to Diff 186779.Feb 13 2019, 5:16 PM

Fixed according to review comments.

nemanjai accepted this revision.Feb 27 2019, 5:01 AM

LGTM with the minor comments addressed on the commit.

Looking at the test case has actually made me think of something else that might be worthwhile following up on...
Namely, I think there would be general benefit of breaking the dependence of the link register restore from the stack pointer update when we are not moving the stack pointer update.

For example, the following change might be a significant improvement in the epilogue:

CSR restores
addi 1, 1, <Value>
ld 0, 16(1)
mtlr 0


CSR restores
ld 0, <Value> + 16(1)
mtlr 0
addi 1, 1, <Value>

This would allow the scheduler to schedule the ld->mtlr sequence for restoring the link register in a way that the CSR restores hide the latency of the sequence. Furthermore, that should be a safe transformation regardless of the size of the stack frame since any interrupt handlers that run during the epilogue of a function must restore the LR when resuming control.

738 ↗(On Diff #186779)

s/until the stack pointer is moved/until the stack pointer is updated

751 ↗(On Diff #186779)


Also, a comment that quite literally just says in words what the code clearly says isn't useful. It is more useful to say why we need the check. Perhaps something like:

// Calls to fast_cc functions use different rules for passing parameters on
// the stack from the ABI and using PIC base in the function imposes
// similar restrictions to using the base pointer. It is not generally safe
// to move the stack pointer updatein these situations.

It might be reasonable to also combine these into a single condition, but this is completely a matter of personal preference.

1485 ↗(On Diff #186779)

s/stack update pointer/update of the stack pointer

1492 ↗(On Diff #186779)

// Abort the operation as we can't update all CSR restores.

This revision is now accepted and ready to land.Feb 27 2019, 5:01 AM
This revision was automatically updated to reflect the committed changes.
Herald added a project: Restricted Project. · View Herald TranscriptFeb 28 2019, 4:22 AM