This is an archive of the discontinued LLVM Phabricator instance.

Support for emitting inline stack probes
ClosedPublic

Authored by AndyAyers on Oct 1 2015, 12:49 PM.

Details

Summary

For CoreCLR on Windows, stack probes must be emitted as inline sequences that probe successive stack pages between the current stack limit and the desired new stack pointer location. This implements support for the inline expansion on x64.

For in-body alloca probes, expansion is done during instruction lowering. For prolog probes, a stub call is initially emitted during prolog creation, and expanded after epilog generation, to avoid complications that arise when introducing new machine basic blocks during prolog and epilog creation.

Added a new test case, modified an existing one to exclude non-x64 coreclr (for now).

Diff Detail

Repository
rL LLVM

Event Timeline

AndyAyers updated this revision to Diff 36282.Oct 1 2015, 12:49 PM
AndyAyers retitled this revision from to Support for emitting inline stack probes.
AndyAyers updated this object.
AndyAyers added reviewers: rnk, majnemer.
AndyAyers set the repository for this revision to rL LLVM.
AndyAyers added a subscriber: llvm-commits.
AndyAyers updated this revision to Diff 36640.Oct 6 2015, 11:25 AM

Same changes as before, but updated to have more context around the diffs.

rnk edited edge metadata.Oct 6 2015, 11:44 AM

I think it'd be cleaner to avoid changing the target independent TargetFrameLowering interface, and instead keep it as an X86-internal pseudo instruction that we lower in X86ExpandPseudo.cpp. Then PEI doesn't have to worry about this MBB expansion issue. What do you think?

I think the WIN_ALLOCA pseudo instruction can probably just take on that new role, and we could rename it to something more accurate like CHKSTK.

asl added a subscriber: asl.Oct 6 2015, 12:09 PM
In D13361#260924, @rnk wrote:

I think it'd be cleaner to avoid changing the target independent TargetFrameLowering interface, and instead keep it as an X86-internal pseudo instruction that we lower in X86ExpandPseudo.cpp. Then PEI doesn't have to worry about this MBB expansion issue. What do you think?

+1

Using a pseduo for the prolog case sounds like a good suggestion.

I'd like to preserve the ability to generate the in-body expansions early, before register allocation, and share the expansion code between the two cases like is done now. \\

If I go this route, where should the expansion code live?

rnk added a comment.Oct 6 2015, 1:16 PM

Using a pseduo for the prolog case sounds like a good suggestion.

I'd like to preserve the ability to generate the in-body expansions early, before register allocation, and share the expansion code between the two cases like is done now. \\

If I go this route, where should the expansion code live?

I guess keeping it in X86FrameLowering would be fine. Seems better than dumping it into the 27K line X86ISelLowering. =/

lib/Target/X86/X86FrameLowering.cpp
553–556

Are we sure we can have pushes without CFI to describe them in the prologue on Win64? If a user or profiler stops after this PUSH, they won't be able to unwind the stack. You could get away with homing RDX / RCX into the shadow space and then reloading afterwards.

Our existing use of push %rax in the prologue code is bad, but it shouldn't normally happen on Win64, since the standard calling convention doesn't pass data in via %rax.

Let me rework this a bit and get back to you.

lib/Target/X86/X86FrameLowering.cpp
553–556

Good catch -- the pushes/pops were just intended to be bring-up scaffolding.

Pseudo expansion isn't set up to handle introduction of control flow. I can generalize the iterator patterns to handle this, if you still think this is the way to go.

The inline prolog expansion isn't that different from the existing segmented stack or HiPE prolog munging, so I wonder if it makes more sense to keep them together like I had it, even if it means one more virtual in TargetFrameLowering.h. We'll eventually implement this for other targets as we add support for them.

AndyAyers updated this revision to Diff 39141.Nov 4 2015, 8:31 AM
AndyAyers edited edge metadata.

Save/restore RCX/RDX from the caller-supplied parameter save area.

Looked into expanding this later but the introduction of control flow by this expansion would require bigger changes there. Thoughs...?

rnk added inline comments.Nov 4 2015, 9:49 AM
lib/Target/X86/X86FrameLowering.cpp
458

This can be simplified to:

for (MachineInstr &MI : PrologueMBB) {
  ...

You can recover iterators from MachineInstrs with the getIterator() method.

460

I don't think you need this loop, all the x86 CALL instructions appear to only take one operand. This should be fine, assuming it passes tests:

if (MI.isCall() && MI.getOperand(0).isSymbol() &&
   ChkStubStubSymbol == MI.getOperand(0).getSymbolName()) {
  ChkStkStub = &MI;
  break;
}
472

I think you want eraseFromParent() here to avoid a leak.

503

Is this check correct for a downwards growing stack? LimitReg here is StackLimit in NT_TIB, which is the minimum allowed value for this thread's SP, right? I think we want to enter the loop if FinalReg >= LimitReg.

What are we supposed to do anyway if we attempt to probe beyond StackLimit? I would expect normal code that doesn't probe will crash when accessing beyond StackLimit, so why can't we skip this check and crash somewhere in the loop?

520

Duncan has been working to remove the implicit iterator conversions around iplist elements, so the new idiom for this is MMBIter = MBB->getIterator().

568

There's a utility called addRegOffset() for forming basereg+displacement memory operands that don't use a segment register, index register, or scale. In this case it would probably look like:

addRegOffset(BuildMI(&MBB, DL, TII.get(X86::MOV64mr)),
             X86::RSP, false, RCXShadowSlot)
  .addReg(X86::RCX);

You can also use this idiom below to shorten the other memory accesses and LEAs. If you think it makes things less clear, then I don't feel too strongly about using it everywhere.

636–637

MSVC CRT and mingw CRT both use TEST for this purpose. Is there a reason why the CLR wants to do a write here instead of a read?

685–692

These middle loops can both be range-based.

test/CodeGen/X86/win_coreclr_chkstk.ll
7

The rest of these tests are looking for the presence or absence of the inlined stack probe. We should have one or two tests that pattern match the whole loop. It'll be brittle to future changes to the probe, but we should only do it once or twice.

103

It's good to bracket negative checks with something that will match closely to them, so that they don't match something from previous code, something like:

; WIN_X64: test_probe_size:
; WIN_X64-NOT: mov %gs:16, %rcx
; WIN_X64: retq

Thanks, I'll do some cleanup and post a revision.

Answered (hopefully) a few of your other questions inline.

lib/Target/X86/X86FrameLowering.cpp
503

The StackLimit field of the TIB has a confusing name. It points at the lowest touched page of the stack -- not the lower limit on how far the stack can grow before the OS raises stack overflow.

The "true" lower stack limit is found in the DeallocationStack field in the TEB.

The idea here is to only touch pages that the OS hasn't seen this thread use before, rather than just touch all the pages in the range between the current SP and the desired new SP. So we only need to probe if the new SP lies beyond (less than, since the stack is growing down) StackLimit.

636–637

In the Windows and VC CRT sources for x64, we always seem to do a write. Probably just paranoia to make sure the OS flips the page protection to R/W instead of just R?

AndyAyers updated this revision to Diff 39707.Nov 9 2015, 8:57 AM

Updated per review feedback.

Range-based for, no implicit iterators, addRegOffset, more stringent checking in test case

rnk accepted this revision.Nov 9 2015, 2:50 PM
rnk edited edge metadata.

Nice, looks good.

This revision is now accepted and ready to land.Nov 9 2015, 2:50 PM
JosephTremoulet edited edge metadata.

Committed in rL252578