This is an archive of the discontinued LLVM Phabricator instance.

[SystemZ] Implement -fstack-clash-protection
ClosedPublic

Authored by jonpa on Apr 23 2020, 8:04 AM.

Details

Summary

So far, a first attempt at implementing this by looking at gcc and also the X86 llvm backend. Prologue and dynamic allocas handled - some questions remain:

  • Prologue:
    • GCC emits an 'lgr %r15,%r1' after the loop, which seems redundant, since it is known that %r15 has the value of %r1 already. Is this required to exist for some reason (omitted by patch for now)?
    • gcc seems not to be probing the residual allocation after the loop. However if only two (unrolled) allocations were made, the residual is also probed.
    • I am not aware of any real reason to not simply do the probing directly in emitPrologue(), but it seems wisest to do like X86 since inlineStackProbe() is called from common-code. Perhaps this relates to implementing shrink-wrapping or other things?
    • emitBlockAfter(), splitBlockBefore() copied from SystemZISelLowering. Make into SysteZInstrInfo members instead?
    • A little unsure about the use of unsigned vs uint64_t...
  • Dynamic allocas:
    • I took the X86 tests and copied them over as SystemZ tests and noticed that SystemZ gets these test cases built by SelectionDAGBuilder with dynamic_stackalloc nodes, while X86 seem to get these (constant) allocas merged into the stack frame. This is true also without this patch, but I am not sure why. In this case it seems even more preferred to avoid the dynamic_stackalloca nodes whenever possible...
    • With dynamic allocas, it seems wise to always probe no matter what the size, but the "tail" in emitProbedAlloca() is not probed. This seems flawed to me:

First of all, there could be multiple dynamic allocas in a function and if they all are less than the ProbeSize a huge span could be built up without any probing:

Tail1 Tail2 Tail3 Tail4   -->
|     |     |     |    |
          GGGGGGGGG

Then I am also worried about exiting the loop and allocating the remainder since only the topmost word in each allocated block is probed. If the guard page lies very close to that, and the remainder is relatively big, the bottom of the stack could end up way past the guard page:

Block1  Block2  Tail       -->
|P      |P      |      |
          GGGGGGGGG

P = Probe, G = Guard page

This looks bad to me, but I really don't know - is this perhaps considered harmless for some reason?

Diff Detail

Event Timeline

jonpa created this revision.Apr 23 2020, 8:04 AM

Not looking at the code so far, but just answering your questions:

GCC emits an 'lgr %r15,%r1' after the loop, which seems redundant, since it is known that %r15 has the value of %r1 already. Is this required to exist for some reason (omitted by patch for now)?

I agree that this looks redundant.

gcc seems not to be probing the residual allocation after the loop. However if only two (unrolled) allocations were made, the residual is also probed.

I'm not seeing this, do you have an example?

I am not aware of any real reason to not simply do the probing directly in emitPrologue(), but it seems wisest to do like X86 since inlineStackProbe() is called from common-code. Perhaps this relates to implementing shrink-wrapping or other things?

The comment you're adding says:

// stack probing may involve looping, and control flow generations is
// disallowed at this point. Rely to later processing through
// `inlineStackProbe`.

That would seem to be the reason why it cannot be done directly inline, right?

emitBlockAfter(), splitBlockBefore() copied from SystemZISelLowering. Make into SysteZInstrInfo members instead?

Makes sense to me.

A little unsure about the use of unsigned vs uint64_t...

Huh. I'd probably just use uint64_t everywhere.

I took the X86 tests and copied them over as SystemZ tests and noticed that SystemZ gets these test cases built by SelectionDAGBuilder with dynamic_stackalloc nodes, while X86 seem to get these (constant) allocas merged into the stack frame. This is true also without this patch, but I am not sure why. In this case it seems even more preferred to avoid the dynamic_stackalloca nodes whenever possible...

That's strange. Not sure why we should be different from X86 here. Can you investigate?

With dynamic allocas, it seems wise to always probe no matter what the size, but the "tail" in emitProbedAlloca() is not probed. This seems flawed to me:

I agree, we need to always probe for dynamic allocas as far as I can see. GCC does this as well.

As a general note, the code duplication between inlineStackProbe and emitProbedAlloca is a bit unfortunate, not sure if there's a better way here.

Herald added a project: Restricted Project. · View Herald TranscriptApr 28 2020, 8:48 AM
jonpa added a comment.May 1 2020, 12:47 AM

gcc seems not to be probing the residual allocation after the loop. However if only two (unrolled) allocations were made, the residual is also probed.

I'm not seeing this, do you have an example?

void large_stack() {
  volatile int stack[2000], i;
  for (i = 0; i < sizeof(stack) / sizeof(int); ++i)
    stack[i] = i;
}

With stack[2000], I see

aghi    %r15,-4096
cg      %r0,4088(%r15)
aghi    %r15,-4072
cg      %r0,4064(%r15)

With stack[8000], I don't see a probe after the loop...

I am not aware of any real reason to not simply do the probing directly in emitPrologue(), but it seems wisest to do like X86 since inlineStackProbe() is called from common-code. Perhaps this relates to implementing shrink-wrapping or other things?

The comment you're adding says:

// stack probing may involve looping, and control flow generations is
// disallowed at this point. Rely to later processing through
// `inlineStackProbe`.

That would seem to be the reason why it cannot be done directly inline, right?

Yes, but I can't see anything other than that comment that (at least currently) demands this - the only use for both emitPrologue() and inlineStackProbe() are right after each other in PEI::insertPrologEpilogCode().

I took the X86 tests and copied them over as SystemZ tests and noticed that SystemZ gets these test cases built by SelectionDAGBuilder with dynamic_stackalloc nodes, while X86 seem to get these (constant) allocas merged into the stack frame. This is true also without this patch, but I am not sure why. In this case it seems even more preferred to avoid the dynamic_stackalloca nodes whenever possible...

That's strange. Not sure why we should be different from X86 here. Can you investigate?

This was due to SystemZ having StackRealignable set to false, and the allocas of those test cases were aligned to 16. Removing the alignments in the tests fixed it (see FunctionLoweringInfo.cpp:143).

As a general note, the code duplication between inlineStackProbe and emitProbedAlloca is a bit unfortunate, not sure if there's a better way here.

I am a little unsure of how performance critical this is... It seems that the handling of inlineStackProbe with a compile-time constant stack size is an optimization of the general case of an unknown size. It seems to be rather important, since the loop is unrolled if only a few iterations are required.

I think we could emit a pseudo from emitProbedAlloca() and handle that in SystemZFrameLowering instead in the same place, if we would allow the case of a constant size be loaded into a register before the loop. The loop itself might also have one extra instruction or so, but the common case would be to unroll that loop...

Can R0D and R1D be assumed to be available in the prologue at this point? (how about anyreg-callconv?)

tstellar added a subscriber: tstellar.

gcc seems not to be probing the residual allocation after the loop. However if only two (unrolled) allocations were made, the residual is also probed.

I'm not seeing this, do you have an example?

void large_stack() {
  volatile int stack[2000], i;
  for (i = 0; i < sizeof(stack) / sizeof(int); ++i)
    stack[i] = i;
}

With stack[2000], I see

aghi    %r15,-4096
cg      %r0,4088(%r15)
aghi    %r15,-4072
cg      %r0,4064(%r15)

With stack[8000], I don't see a probe after the loop...

Well, I see it:

        lay     %r1,-28672(%r15)
.L5:
        lay     %r15,-4096(%r15)
        cg      %r0,4088(%r15)
        clgr    %r15,%r1
        jh      .L5
        lgr     %r15,%r1
        lay     %r15,-3496(%r15)
        cg      %r0,3488(%r15)

There's one cg in the loop, and this last cg for the tail. That's with GCC 9, not sure if that changed recently.

I am not aware of any real reason to not simply do the probing directly in emitPrologue(), but it seems wisest to do like X86 since inlineStackProbe() is called from common-code. Perhaps this relates to implementing shrink-wrapping or other things?

The comment you're adding says:

// stack probing may involve looping, and control flow generations is
// disallowed at this point. Rely to later processing through
// `inlineStackProbe`.

That would seem to be the reason why it cannot be done directly inline, right?

Yes, but I can't see anything other than that comment that (at least currently) demands this - the only use for both emitPrologue() and inlineStackProbe() are right after each other in PEI::insertPrologEpilogCode().

Hmm, not sure either. But either the comment is correct or not; if it is, we need this split, if it isn't, the comment should be removed :-) Can you ask who added that callback for Intel why they thought this was necessary?

As a general note, the code duplication between inlineStackProbe and emitProbedAlloca is a bit unfortunate, not sure if there's a better way here.

I am a little unsure of how performance critical this is... It seems that the handling of inlineStackProbe with a compile-time constant stack size is an optimization of the general case of an unknown size. It seems to be rather important, since the loop is unrolled if only a few iterations are required.

I think we could emit a pseudo from emitProbedAlloca() and handle that in SystemZFrameLowering instead in the same place, if we would allow the case of a constant size be loaded into a register before the loop. The loop itself might also have one extra instruction or so, but the common case would be to unroll that loop...

Hmm, there's probably enough differences that may make it not worth spending effort here, given that you've already implemented it now.

Can R0D and R1D be assumed to be available in the prologue at this point? (how about anyreg-callconv?)

Yes. See SystemZCallingConv.td:

// "All registers" as used by the AnyReg calling convention.
// Note that registers 0 and 1 are still defined as intra-call scratch
// registers that may be clobbered e.g. by PLT stubs.

With dynamic allocas, it seems wise to always probe no matter what the size, but the "tail" in emitProbedAlloca() is not probed. This seems flawed to me:

I just had a look to the X86 and indeed, it's a falw, I'll propose a patch.

As a general note, the code duplication between inlineStackProbe and emitProbedAlloca is a bit unfortunate, not sure if there's a better way here.

I'll explore that aspect too, thanks for the code review :-)

jonpa added a comment.May 5 2020, 4:24 AM

With dynamic allocas, it seems wise to always probe no matter what the size, but the "tail" in emitProbedAlloca() is not probed. This seems flawed to me:

I just had a look to the X86 and indeed, it's a falw, I'll propose a patch.

As a general note, the code duplication between inlineStackProbe and emitProbedAlloca is a bit unfortunate, not sure if there's a better way here.

I'll explore that aspect too, thanks for the code review :-)

Ah, thanks for clarifying this :-)

It would also be useful if you could explain the comment you added "stack probing may involve looping, and control flow generations is disallowed at this point. Rely to later processing through inlineStackProbe". Is this still true? It probably is, I just can't see why this is needed since emitPrologue() and inlineStackProbe() are run nearly right after eachother in PEI::insertPrologEpilogCode().

jonpa added a comment.May 5 2020, 7:18 AM

@serge-sans-paille: Last question was for you, which I forgot to write...

@serge-sans-paille: Last question was for you, which I forgot to write...

Yeah, I was busy with another bug today, I'll probably take the time to review that tonight / tomorrow. Thanks for the heads up!

As a general note, the code duplication between inlineStackProbe and emitProbedAlloca is a bit unfortunate, not sure if there's a better way here.

I'll explore that aspect too, thanks for the code review :-)

At first glance, the code is indeed very similar, to the exception of the handling of the tail (no probing needed for the tail in inlineStackProbe, but one is required for emitProbedAlloca) and the fact that the Size of the alloca is a constant in one case, and not in the other case. But it looks like some extra argument could do the trick.

It would also be useful if you could explain the comment you added "stack probing may involve looping, and control flow generations is disallowed at this point. Rely to later processing through inlineStackProbe". Is this still true? It probably is, I just can't see why this is needed since emitPrologue() and inlineStackProbe() are run nearly right after eachother in PEI::insertPrologEpilogCode().

It's clearer if I just state that we're reusing the existing probe infrastructure. I'm not a MachineInstruction expert and don't recall why I wrote this :-/

It would also be useful if you could explain the comment you added "stack probing may involve looping, and control flow generations is disallowed at this point. Rely to later processing through inlineStackProbe". Is this still true? It probably is, I just can't see why this is needed since emitPrologue() and inlineStackProbe() are run nearly right after eachother in PEI::insertPrologEpilogCode().

It's clearer if I just state that we're reusing the existing probe infrastructure. I'm not a MachineInstruction expert and don't recall why I wrote this :-/

Ah, I see. I suppose then that it is sort of "voluntary" for a target to take the trouble of using the stub with metadata, depending on if it would simplify things? The original comment by Andy Ayers in his commit message from 2015 ( 809cbe9) formulates this in a weaker way: "to avoid complications". I am not sure exactly what complications that might arise (in the case where an MBB is both a SaveBlock and RestoreBlock the particular way of splitting may get more complicated, perhaps?), but perhaps it would avoid confusion if you changed your comment to use Andys wording instead...

@AndyAyers : Do you recall your reasons from back then?

jonpa added a comment.May 6 2020, 1:30 AM

There's one cg in the loop, and this last cg for the tail. That's with GCC 9, not sure if that changed recently.

huh - I also see it with gcc 9.2.1, but not with a more recent version (20200425)...

jonpa updated this revision to Diff 262670.May 7 2020, 9:05 AM

Backend part updated.

  • Added probing of the tail of dyn-alloca, with an extra check for zero tail, roughly like GCC. It seems it can be assumed that the alloca size will always be a multiple of 8 bytes, or? I think that is necessary, or there might be final probe partially below SP (if tail is e.g. 4 bytes). Also, if the stack-probe size is 0, an infinite loop would result (assert?), but I suppose that would always be noticeable.
  • (re)compute live-ins for the new blocks created in inlineStackProbe() (it seems this is not needed for the probed alloca case). The generated code from inlineStackProbe() does roughly the same as GCC. We could use a brct for the loop instead, or we could try to keep it at 12 or more instructions using multiple exits (like "forced" unrolling), but I suppose maybe the unrolling is the common case, so the loop isn't that important?
  • Tests: Not sure if we need all the X86 test cases - for instance stack-clash-medium-natural-probes.ll seems to generate nearly the same code as stack-clash-medium.ll. stack-clash-unknown-call.ll: Changed the called function from memset to something not known, since the point of the test case seemed to be to have a call in the function, which however did not work at first since an XC loop resulted.
  • Moved the utility functions to SystemZInstrInfo.cpp as agreed before.

Huh. I'd probably just use uint64_t everywhere.

I think what confused me a bit was that MCCFIInstruction::createDefCfaOffset() takes an 'int' as argument, but I suppose it's not the end of the world if the CFA offset is possibly broken if the stack frame size ever became greater than -INT32_MIN...?

It would also be useful if you could explain the comment you added "stack probing may involve looping, and control flow generations is disallowed at this point. Rely to later processing through inlineStackProbe". Is this still true? It probably is, I just can't see why this is needed since emitPrologue() and inlineStackProbe() are run nearly right after eachother in PEI::insertPrologEpilogCode().

It's clearer if I just state that we're reusing the existing probe infrastructure. I'm not a MachineInstruction expert and don't recall why I wrote this :-/

Ah, I see. I suppose then that it is sort of "voluntary" for a target to take the trouble of using the stub with metadata, depending on if it would simplify things? The original comment by Andy Ayers in his commit message from 2015 ( 809cbe9) formulates this in a weaker way: "to avoid complications". I am not sure exactly what complications that might arise (in the case where an MBB is both a SaveBlock and RestoreBlock the particular way of splitting may get more complicated, perhaps?), but perhaps it would avoid confusion if you changed your comment to use Andys wording instead...

@AndyAyers : Do you recall your reasons from back then?

Unfortunately, no. I did a quick look for notes I might have made at the time and didn't find any.

Added probing of the tail of dyn-alloca, with an extra check for zero tail, roughly like GCC. It seems it can be assumed that the alloca size will always be a multiple of 8 bytes, or? I think that is necessary, or there might be final probe partially below SP (if tail is e.g. 4 bytes). Also, if the stack-probe size is 0, an infinite loop would result (assert?), but I suppose that would always be noticeable.

The size argument to DYNAMIC_STACKALLOC is indeed always a multiple of 8 (as documented in ISDOpcodes.h).

As to the stack probe size, I think must indeed ensure that this size is also a multiple of 8 bytes (the stack alignment requirement) and is nonzero. If the given value doesn't fulfil those requirements, I guess we should round it down to the stack alignment requirement, and if the result is zero, use the stack alginment requirement instead.

(re)compute live-ins for the new blocks created in inlineStackProbe() (it seems this is not needed for the probed alloca case). The generated code from inlineStackProbe() does roughly the same as GCC. We could use a brct for the loop instead, or we could try to keep it at 12 or more instructions using multiple exits (like "forced" unrolling), but I suppose maybe the unrolling is the common case, so the loop isn't that important?

I think just having a plain loop is OK for now. For any future optimizations, we'd have to do performance evaluation first.

Tests: Not sure if we need all the X86 test cases - for instance stack-clash-medium-natural-probes.ll seems to generate nearly the same code as stack-clash-medium.ll.

It seems this was added in preparation for exploiting "natural probes" -- if the code can be proven to already touch all (or at least some) stack pages at least once (in this test case via the stores), then we might optimize out some of the extra probes. But apparently this wasn't implemented in the X86 back-end in the end, and neither is it in your patch, so it doesn't make sense to have a test for it.

stack-clash-unknown-call.ll: Changed the called function from memset to something not known, since the point of the test case seemed to be to have a call in the function, which however did not work at first since an XC loop resulted.

I believe this is also related to natural probes: on *Intel*, every call instruction would be a natural probe as it pushes the return address to the stack. But on Z, the call instruction doesn't touch the stack (the called function might, but it also might not, so we cannot really rely on it).

Moved the utility functions to SystemZInstrInfo.cpp as agreed before.

OK.

Huh. I'd probably just use uint64_t everywhere.

I think what confused me a bit was that MCCFIInstruction::createDefCfaOffset() takes an 'int' as argument, but I suppose it's not the end of the world if the CFA offset is possibly broken if the stack frame size ever became greater than -INT32_MIN...?

Hmm. In theory, those DWARF offset values are encoded as ULEB128, so there's no reason to constrain them to an "int". If this ever becomes an issue, we should update MCCFIInstruction to use int64_t instead. For now, that's probably not a real concern; I'm sure there's other code that doesn't handle stack sizes > 2GB correctly. (E.g. do we even update the stack pointer correctly in this case?)

jonpa updated this revision to Diff 263129.May 11 2020, 2:53 AM

As to the stack probe size, I think must indeed ensure that this size is also a multiple of 8 bytes (the stack alignment requirement) and is nonzero. If the given value doesn't fulfil those requirements, I guess we should round it down to the stack alignment requirement, and if the result is zero, use the stack alginment requirement instead.

I implemented this in getStackProbeSize() with a few new tests.

I think just having a plain loop is OK for now. For any future optimizations, we'd have to do performance evaluation first.

I built and ran SPEC'17 once and it looked like there were no sign of regressions, at most a single percent...

It seems this was added in preparation for exploiting "natural probes" -- if the code can be proven to already touch all (or at least some) stack pages at least once (in this test case via the stores), then we might optimize out some of the extra probes. But apparently this wasn't implemented in the X86 back-end in the end, and neither is it in your patch, so it doesn't make sense to have a test for it.

Ah, now I get it. Removed those tests.

I believe this is also related to natural probes: on *Intel*, every call instruction would be a natural probe as it pushes the return address to the stack. But on Z, the call instruction doesn't touch the stack (the called function might, but it also might not, so we cannot really rely on it).

removed

Merged remaining X86 "small/medium/large" into stack-clash-protection.ll.

It seems this was added in preparation for exploiting "natural probes" -- if the code can be proven to already touch all (or at least some) stack pages at least once (in this test case via the stores), then we might optimize out some of the extra probes. But apparently this wasn't implemented in the X86 back-end in the end, and neither is it in your patch, so it doesn't make sense to have a test for it.

Correct!

I believe this is also related to natural probes: on *Intel*, every call instruction would be a natural probe as it pushes the return address to the stack. But on Z, the call instruction doesn't touch the stack (the called function might, but it also might not, so we cannot really rely on it).

Confirmed!

jonpa added a comment.May 14 2020, 3:45 AM

I experimented with inserting the probing directly in emitPrologue() instead of building the call with metadata, but then came back to the case with a single block function. Looking closer at this, it seems that this could be the motivation behind using the call/metadata stub. Splitting the Prologue/Epilogue block to insert the loop would cause the RestoreBlocks set of PEI to be incorrect, since the epilogue block has changed from MBB to DoneMBB (block after the loop / second half of original MBB).

This could theoretically be fixed by rearranging things so that MBB ended up to have the same instructions and the same place as DoneMBB now gets, but that seems impractical and besides that would break the SaveBlocks set of PEI... (I think perhaps a comment explaining this would be nice.)

There is an assert for the metadata attached to the call instruction, which kind of guards against users defining a function with the same name. Normally, I would have expected a pseudo MachineInstruction to be built here with an immediate operand for the size requirement...

Normally, I would have expected a pseudo MachineInstruction to be built here with an immediate operand for the size requirement...

That sounds very nice. If you go that way I'll happily update the X86 code accordingly. Or If you point me to some example / doc / code on pseudo MachineInstruction, I can implement that too.

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
671

Maybe we should syndicate that unroll parameter somewhere across architectures?

709

On X86, we don't need to probe that final allocation because the two ways the stack could grow after the final alloca are

  1. A function call, and in that case we get a free probe when we make a function call
  2. A PROBED_ALLOCA, and in that case we get a probe at Residual + PAGE_SIZE, which is right into the Page Guard.

I assume that's different for SystemZ?

jonpa marked 4 inline comments as done.May 18 2020, 2:31 AM
jonpa added inline comments.
llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
671

I am not sure I see the benefit of that since this entire method is already defined by the target.

709

There's not always a free probe with a function call on SystemZ (only if stack space is allocated by the called function), but regardless of that I think the residual is needed. Since the probe is done on the topmost byte of each allocated block on SystemZ, the guard page could fit within the last allocated full block and a residual of just 8 or more bytes (probing the high byte is what GCC is doing on SystemZ).

I still wonder if that would not be necessary also on X86 in case "2": Given that you on X86 probe the lower byte of each full block, you could get a residual into the guard page, and then if the dynamic alloca started with a full block, the next probe would not be into the guard page, but past it:

|     |     |GGGGG|     |
     |    P|   R|    P|

Maybe I am missing something?

jonpa marked 2 inline comments as done.May 18 2020, 2:37 AM

Normally, I would have expected a pseudo MachineInstruction to be built here with an immediate operand for the size requirement...

That sounds very nice. If you go that way I'll happily update the X86 code accordingly. Or If you point me to some example / doc / code on pseudo MachineInstruction, I can implement that too.

@uweigand : Would you rather use a target pseudo instruction for this rather than using a call with metadata?

Normally, I would have expected a pseudo MachineInstruction to be built here with an immediate operand for the size requirement...

That sounds very nice. If you go that way I'll happily update the X86 code accordingly. Or If you point me to some example / doc / code on pseudo MachineInstruction, I can implement that too.

@uweigand : Would you rather use a target pseudo instruction for this rather than using a call with metadata?

Using a pseudo is probably a cleaner solution, yes.

jonpa updated this revision to Diff 264643.May 18 2020, 8:49 AM

Use a pseudo instead of call with metadata.

jonpa added a comment.May 22 2020, 1:42 AM

@serge-sans-paille:

That sounds very nice. If you go that way I'll happily update the X86 code accordingly. Or If you point me to some example / doc / code on pseudo MachineInstruction, I can implement that too.

This patch has now been changed to use a pseudo instead, so you can see how that works here...

Also, I wonder what you think about my previous question to you about the probing on X86 ..?

lkail added a subscriber: lkail.May 25 2020, 5:47 AM
llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
709

I took my pen and paper, and you're definitively right. Fortunately, there's the possibility of always probing the upper byte when doing a dynamic alloca, so that we always avoid this extra probe. That way the common case (alloc_size < PAGE_SIZE) remains costless. Correct?

jonpa marked an inline comment as done.May 27 2020, 12:57 AM
jonpa added inline comments.
llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
709

IIUC, the topmost byte of each allocated stack range is always naturally probed before the entry of the prologue by the push of the return address by the call. The lowermost byte is then probed in each full block, but the residual needs no probing.

In order get a matching behavior with dynamic allocas, the topmost byte should always be probed, as well as the lowermost byte of each full block.

So yes, I think that would be correct as long as there are no other allocations of stack space anywhere...

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
625

could be:

auto Where = llvm::find_if(PrologMBB, [](MachineInstr& MI) { return MI.getOpcode() == SystemZ::STACKALLOC_W_PROBING;});
if(Where = PrologMBB.end()) return;
MachineInstr &StackAllocMI = *Where;
jonpa updated this revision to Diff 268419.Jun 4 2020, 4:31 AM
  • Patch rebased (using cfiDefCfaOffset() instead of createDefCfaOffset()).
  • Check for a free probe (STMG) in prologue in which case probing is not done when the space between that STMG and the new stack pointer is less than the probe size. This saves the vast majority of the probing (removes 95% of the CG instructions and 90% of the number of files changed).
uweigand accepted this revision.Jun 5 2020, 8:26 AM

A few cosmetic issues mentioned inline, otherwise this now looks good to me. Thanks!

llvm/lib/Target/SystemZ/SystemZFrameLowering.cpp
409

I believe this will now fit onto a single line.

llvm/lib/Target/SystemZ/SystemZISelLowering.cpp
6860

The RHS can be simplified to ~(StackAlign - 1)

llvm/lib/Target/SystemZ/SystemZInstrInfo.td
38

Doesn't this also need Defs and Uses to be fully correct?

39

For consistency with the PROBED_ALLOCA name, maybe this should be called PROBED_STACKALLOC?

This revision is now accepted and ready to land.Jun 5 2020, 8:26 AM
jonpa marked 5 inline comments as done.Jun 6 2020, 9:42 AM

Thanks for review - committed as 515bfc6 after last updates.

llvm/lib/Target/SystemZ/SystemZInstrInfo.td
38

Ah, yes I suppose that might as well be there... I added the defs and uses and also the side-effects flag since it may expand into a loop.

This revision was automatically updated to reflect the committed changes.
jonpa marked an inline comment as done.
Herald added a project: Restricted Project. · View Herald TranscriptJun 6 2020, 10:08 AM
Herald added a subscriber: cfe-commits. · View Herald Transcript