Page MenuHomePhabricator

AArch64: Fix emergency spillslot being out of reach for large callframes
ClosedPublic

Authored by MatzeB on Dec 5 2017, 6:11 PM.

Details

Summary

Large callframes (calls with several hundreds or thousands or
parameters) could lead to situations in which the emergency spillslot is
out of range to be addressed relative to the stack pointer.
This commit forces the use of a frame pointer in the presence of large
callframes.

This commit does several things:

  • Compute max callframe size at the end of instruction selection.
  • Add mirFileLoaded target callback. Use it to compute the max callframe size after loading a .mir file when the size wasn't specified in the file.
  • Let TargetFrameLowering::hasFP() return true if there exists a callframe > 255 bytes.
  • Always place the emergency spillslot close to FP if we have a frame pointer.
  • Note that useFPForScavengingIndex() would previously return false when a base pointer was available leading to the emergency spillslot getting allocated late (that's the whole effect of this callback). Which made no sense to me so I took this case out: Even though the emergency spillslot is technically not referenced by FP in this case we still want it allocated early.

I'm still in the process of coming up with a reliable and good testcase. As testcases with hundred of arguments are unwieldy I will probably end up placing a correctness test into the llvm test-suite for this.

Diff Detail

Repository
rL LLVM

Event Timeline

MatzeB created this revision.Dec 5 2017, 6:11 PM
MatzeB added inline comments.Dec 5 2017, 6:13 PM
lib/Target/AArch64/AArch64FrameLowering.cpp
1269 ↗(On Diff #125663)

Please ignore this line: I accidentally didn't take out this debug helper.

As testcases with hundred of arguments are unwieldy

You might be able to get a similar effect with "define void @foo([1000 x i32] %x) {" or something like that. (I'm not sure when exactly we need the emergency spill slot on aarch64, so I'm not sure that helps.)

I'm still in the process of coming up with a reliable and good testcase. As testcases with hundred of arguments are unwieldy I will probably end up placing a correctness test into the llvm test-suite for this.

FWIW, a while ago, I added the following test to the test-suite to test a whole range of different frame layout aspects. Maybe that test could be extended for the case of a large frame somehow?
Probably, the place to look into extending this test would be to add a new template parameter to function check_frame_variant.
http://llvm.org/viewvc/llvm-project/test-suite/trunk/MultiSource/UnitTests/C%2B%2B11/frame_layout/frame_layout.cpp?view=markup

MatzeB added a comment.Dec 6 2017, 5:06 PM

So this works as a testcase, however I wouldn't want to check this in either; as no matter what I do, for huge callframes we always produce a huge number of stores in selectiondag (or fastisel) that take a long time to create and optimize away even when the argument is simply undef...

target triple = "arm64--"

declare void @extfunc([4096 x i64] %p)

define void @func() {
  %lvar = alloca i8

  %v00 = load volatile i8, i8* %lvar
  %v01 = load volatile i8, i8* %lvar
  %v02 = load volatile i8, i8* %lvar
  %v03 = load volatile i8, i8* %lvar
  %v04 = load volatile i8, i8* %lvar
  %v05 = load volatile i8, i8* %lvar
  %v06 = load volatile i8, i8* %lvar
  %v07 = load volatile i8, i8* %lvar
  %v08 = load volatile i8, i8* %lvar
  %v09 = load volatile i8, i8* %lvar
  %v10 = load volatile i8, i8* %lvar
  %v11 = load volatile i8, i8* %lvar
  %v12 = load volatile i8, i8* %lvar
  %v13 = load volatile i8, i8* %lvar
  %v14 = load volatile i8, i8* %lvar
  %v15 = load volatile i8, i8* %lvar
  %v16 = load volatile i8, i8* %lvar
  %v17 = load volatile i8, i8* %lvar
  %v18 = load volatile i8, i8* %lvar
  %v19 = load volatile i8, i8* %lvar
  %v20 = load volatile i8, i8* %lvar
  %v21 = load volatile i8, i8* %lvar
  %v22 = load volatile i8, i8* %lvar
  %v23 = load volatile i8, i8* %lvar
  %v24 = load volatile i8, i8* %lvar
  %v25 = load volatile i8, i8* %lvar
  %v26 = load volatile i8, i8* %lvar
  %v27 = load volatile i8, i8* %lvar
  %v28 = load volatile i8, i8* %lvar
  %v29 = load volatile i8, i8* %lvar
  %v30 = load volatile i8, i8* %lvar
  %v31 = load volatile i8, i8* %lvar

  store volatile i8 %v00, i8* %lvar
  store volatile i8 %v01, i8* %lvar
  store volatile i8 %v02, i8* %lvar
  store volatile i8 %v03, i8* %lvar
  store volatile i8 %v04, i8* %lvar
  store volatile i8 %v05, i8* %lvar
  store volatile i8 %v06, i8* %lvar
  store volatile i8 %v07, i8* %lvar
  store volatile i8 %v08, i8* %lvar
  store volatile i8 %v09, i8* %lvar
  store volatile i8 %v10, i8* %lvar
  store volatile i8 %v11, i8* %lvar
  store volatile i8 %v12, i8* %lvar
  store volatile i8 %v13, i8* %lvar
  store volatile i8 %v14, i8* %lvar
  store volatile i8 %v15, i8* %lvar
  store volatile i8 %v16, i8* %lvar
  store volatile i8 %v17, i8* %lvar
  store volatile i8 %v18, i8* %lvar
  store volatile i8 %v19, i8* %lvar
  store volatile i8 %v20, i8* %lvar
  store volatile i8 %v21, i8* %lvar
  store volatile i8 %v22, i8* %lvar
  store volatile i8 %v23, i8* %lvar
  store volatile i8 %v24, i8* %lvar
  store volatile i8 %v25, i8* %lvar
  store volatile i8 %v26, i8* %lvar
  store volatile i8 %v27, i8* %lvar
  store volatile i8 %v28, i8* %lvar
  store volatile i8 %v29, i8* %lvar
  store volatile i8 %v30, i8* %lvar
  store volatile i8 %v31, i8* %lvar

  call void @extfunc([4096 x i64] undef)
  ret void
}

(this takes an embarrassing 50 seconds to run on my machine)

I can reproduce the crash with the following:

target triple = "arm64--"
declare void @extfunc([4096 x i64]* byval %p)
define void @func([4096 x i64]* %z) {
  %lvar = alloca [31 x i8]
  %v00 = load volatile [31 x i8], [31 x i8]* %lvar
  store volatile [31 x i8] %v00, [31 x i8]* %lvar
  call void @extfunc([4096 x i64]* byval %z)
  ret void
}
MatzeB added a comment.Dec 6 2017, 5:25 PM

I can reproduce the crash with the following:

Oh indeed the byval variant seems to compile fast, thanks!

MatzeB updated this revision to Diff 125865.Dec 6 2017, 5:52 PM

Added testcase

One thing I noticed when looking at this: it seems fragile to me to have MachineFrameInfo::computeMaxCallFrameSize() and PEI::calculateFrameInfo() doing essentially the same calculation with duplicated code.
Would it make sense to add an assert to PEI::calculateFrameInfo that checks that the previously calculated MaxCallFrameSize isn't smaller than the one calculated later?

lib/Target/AArch64/AArch64FrameLowering.cpp
207 ↗(On Diff #125865)

Could you get rid of the magic 255 constant here?

One thing I noticed when looking at this: it seems fragile to me to have MachineFrameInfo::computeMaxCallFrameSize() and PEI::calculateFrameInfo() doing essentially the same calculation with duplicated code.
Would it make sense to add an assert to PEI::calculateFrameInfo that checks that the previously calculated MaxCallFrameSize isn't smaller than the one calculated later?

After thinking about this some more, adding this assert doesn't make much sense, since e.g. spilling will increase the MaxCallFrameSize. Speaking of which, don't we still have a problem if e.g. spilling puts us over the large frame size threshold?

One thing I noticed when looking at this: it seems fragile to me to have MachineFrameInfo::computeMaxCallFrameSize() and PEI::calculateFrameInfo() doing essentially the same calculation with duplicated code.
Would it make sense to add an assert to PEI::calculateFrameInfo that checks that the previously calculated MaxCallFrameSize isn't smaller than the one calculated later?

Indeed, that's why I added this to calculateCallFrameInfo() a while back:

assert(!MFI.isMaxCallFrameSizeComputed() ||
       (MFI.getMaxCallFrameSize() == MaxCallFrameSize &&
        MFI.adjustsStack() == AdjustsStack));

One thing I noticed when looking at this: it seems fragile to me to have MachineFrameInfo::computeMaxCallFrameSize() and PEI::calculateFrameInfo() doing essentially the same calculation with duplicated code.
Would it make sense to add an assert to PEI::calculateFrameInfo that checks that the previously calculated MaxCallFrameSize isn't smaller than the one calculated later?

After thinking about this some more, adding this assert doesn't make much sense, since e.g. spilling will increase the MaxCallFrameSize. Speaking of which, don't we still have a problem if e.g. spilling puts us over the large frame size threshold?

Spilling is no problem as the emergency spillslot is placed last in the stackframe (when doing SP relative access). The problem with callframe is just that they have to come after the stackframe. With this change the problematice cases are switched to use a frame pointer which means we will move the emergency spillslot to the beginning of the stackframe where it will always be in reach for the FP.

MatzeB updated this revision to Diff 127417.Dec 18 2017, 3:12 PM
  • Factor out magic number into a global static constant.
MatzeB updated this revision to Diff 127419.Dec 18 2017, 3:15 PM

And add a comment of why the DefaultSafeSPDisplacement is good enough in this case and we don't care about vector loads/stores supporting no offset at all.

gberry added a comment.Jan 3 2018, 8:53 AM

I think what is bothering me about this change is that the return value of hasFP() now seems more dynamic. Did you consider a potentially simpler fix of just creating the spill slot as a fixed stack object with a hard-coded offset that would guarantee it is directly addressable from the SP/FP?

MatzeB added a comment.Jan 3 2018, 8:56 AM

I think what is bothering me about this change is that the return value of hasFP() now seems more dynamic. Did you consider a potentially simpler fix of just creating the spill slot as a fixed stack object with a hard-coded offset that would guarantee it is directly addressable from the SP/FP?

I do create a spill slot that is easily accessible from FP, that's the whole point of this patch. Unfortunately I have to force FP usage in these cases, hence the hasFP changes.

I don't see any way to this SP relative, since at the point of the call SP has to point to the callframe, no way around that. And the callframe is so large in this situation that we cannot reach the spillslot before it by a simple immediate.

Okay, I get it now, I had the wrong case in mind. I was thinking that you could put the scavenge slot close to [sp], but that doesn't work since you have a large outgoing stack parameter area.
This change looks good to me, but you might want to get a second opinion since my thinking on this hasn't been that clear.

Potential alternative approach: could we allocate the emergency spill slot in the red zone?

lib/Target/AArch64/AArch64FrameLowering.cpp
213 ↗(On Diff #127419)

isMaxCallFrameSizeComputed() is false until after register allocation, right? That means register allocation will never allocate a value into the frame pointer register. This needs better comments, and I'm not sure we want to unconditionally reserve the frame pointer register just to fix an obscure bug with very large call frames.

gberry added a comment.Jan 3 2018, 1:42 PM

I'm not sure using the redzone is safe on all targets.

lib/Target/AArch64/AArch64FrameLowering.cpp
213 ↗(On Diff #127419)

Part of this change is to call computeMaxCallFrameSize() from finalizeLowering() (called at the end of ISel), so I don't think this is a problem?

efriedma added inline comments.Jan 3 2018, 1:50 PM
lib/Target/AArch64/AArch64FrameLowering.cpp
213 ↗(On Diff #127419)

Oh, sorry, missed that part. That's much less scary, then. The only thing I can think of it might impact is inline asm. Do we have any testcases with inline asm using the frame pointer as an operand or result?

MatzeB added inline comments.Jan 8 2018, 9:49 AM
lib/Target/AArch64/AArch64FrameLowering.cpp
213 ↗(On Diff #127419)

I also added an early computeMaxCallFrameSize() to the ARM target a while ago and so far it is working fine.

As for inline assembly:

  • Note that this only changes behavior in code with very unusual calls having a huge number of parameters.
  • I would consider omitting the frame pointer an optimization that you cannot/should not rely on as there are various factors outside the authors control that could influence whether the compiler is able to perform it.
  • There is no inline asm constraint letter defined for FP as far as I can see (so you only could rely on it indirectly by having so many register arguments to inline assembly that you need to use the FP).

There is no inline asm constraint letter defined for FP as far as I can see (so you only could rely on it indirectly by having so many register arguments to inline assembly that you need to use the FP).

You can constrain an asm operand to a specific register using a local register variable. Granted, the behavior of that is kind of confusing even without this patch, so maybe it doesn't matter.

efriedma accepted this revision.Jan 8 2018, 4:03 PM

LGTM

Inline asm explicitly specifying fp probably works as well as it ever did (which is not very well, given you can very easily generate bad assembly), and I can't see any other issues, especially if ARM is already doing something similar with computeMaxCallFrameSize.

This revision is now accepted and ready to land.Jan 8 2018, 4:03 PM
This revision was automatically updated to reflect the committed changes.