This is an archive of the discontinued LLVM Phabricator instance.

[X86] Improve lowering of the unrolled inline-asm probing
Needs ReviewPublic

Authored by nagisa on Mar 18 2021, 4:55 PM.

Details

Summary

This new implementation emits instructions such as these:

movb $0, -4096(%rsp)

which is both faster and smaller than pairs of

sub $4096, %rsp
movq $0, (%rsp)

This implementation also trivially preserves the preciseness of the
uwtables during the preamble by not modifying the stack pointer in the
first place.

Testing the generated code for stacks of 0x4000 bytes (4 probes) llvm-mca reports
(over 100 iterations):

test casemcpucyclesIPCRThroughput
oldznver26031.662.5
newznver22042.941.5
oldskylake6031.664.0
newskylake4031.494.0
oldbdver16031.666.0
newbdver14031.494.0
oldhaswell6031.664.0
newhaswell4031.494.0

So overall in terms of throughput its either the same or
an improvement.

Depends on D98909

Diff Detail

Event Timeline

nagisa created this revision.Mar 18 2021, 4:55 PM
nagisa updated this revision to Diff 331724.Mar 18 2021, 5:03 PM

update tests to not scrub away stack pointers

nagisa updated this revision to Diff 331731.Mar 18 2021, 5:27 PM

rebase on top of tests in D98909

nagisa retitled this revision from [x86] Improve lowering of unrolled inline-asm probing to [X86] Improve lowering of unrolled inline-asm probing.Mar 18 2021, 5:27 PM
nagisa edited the summary of this revision. (Show Details)
nagisa updated this revision to Diff 331741.Mar 18 2021, 6:34 PM

restore the code that handled overaligned allocs

nagisa published this revision for review.Mar 18 2021, 6:37 PM
nagisa added a subscriber: YangKeao.
Herald added a project: Restricted Project. · View Herald TranscriptMar 18 2021, 6:37 PM
nagisa retitled this revision from [X86] Improve lowering of unrolled inline-asm probing to [X86] Improve lowering of the unrolled inline-asm probing.Mar 18 2021, 6:37 PM
nagisa edited the summary of this revision. (Show Details)Mar 21 2021, 7:16 AM

Thanks for proposing this optimization. There's a reason (which is debatable) why we explicitly sub before moving :

Consider the following:

0000000000400400 <main>:
  400400:       c7 84 24 fc bf ff ff    movl   $0x1,-0x4004(%rsp)
  400407:       01 00 00 00 
  40040b:       c3                      retq

the mov accesses « unallocated » stack, which may be spotted as illegal by tools verifying memory accesses. When running valgrind on the above, I indeed get

==1335611== Invalid write of size 4
==1335611==    at 0x400400: main (in /tmp/a.out)
==1335611==  Address 0x1ffeffb534 is not stack'd, malloc'd or (recently) free'd
nagisa added a comment.EditedMar 22 2021, 6:45 AM

Ah, I see!

Well, this isn't strictly just an optimization. This sequence came to me as I was thinking how to make CFI metadata correct with least amount of work on my end. It seems that in presence of tools such as valgrind we don't really have any other choice than to carefully emit a -cfi_def_cfa_offset for each of the subs. Which is fine I guess, but leaves me not super happy about the code we emit.

Besides the accurate uwtables, there's another conflicting use-case here – signal handlers. If we sub and then probe, we may receive a signal (and allocate a stack slot for it) in a potentially invalid stack area. Whereas if we probe and then sub, any signal handlers that occur during probing would still execute on what is significantly more likely to be a valid stack.

So, I guess, my question here is this: Is the codegen responsible for generating code that's palatable to analysis tools, or are the analysis tools responsible for comprehending the code that they inspect?

Be careful about referencing stack locations below the red-zone.

The x86-64 ABI mandates a 128 bytes red zone. Your new instruction would effectively write a zero at a location below the red-zone.
That region of memory is not reserved, and should be considered volatile. For example: signals and interrupt handlers are allowed to modify it.

That's the reason why, for leaf functions, compilers always emit a SUB of RSP in the case of too big negative offsets (w.r.t. RSP).

Be careful about referencing stack locations below the red-zone.

The x86-64 ABI mandates a 128 bytes red zone. Your new instruction would effectively write a zero at a location below the red-zone.
That region of memory is not reserved, and should be considered volatile. For example: signals and interrupt handlers are allowed to modify it.

Isn't the volatility unimportant for the purposes of stack probing? The write exists only to poke the page (lightly) to ensure there's no page that is not rw, probing does not particularly care if the data at the relevant addresses are overwritten, or even that the byte is written to the address in the first place (as long as it triggers page permission checks). The only concern could be that some signal handler or an interrupt put some data there, but when these are running the probing itself is suspended, isn't it?

Be careful about referencing stack locations below the red-zone.

The x86-64 ABI mandates a 128 bytes red zone. Your new instruction would effectively write a zero at a location below the red-zone.
That region of memory is not reserved, and should be considered volatile. For example: signals and interrupt handlers are allowed to modify it.

Isn't the volatility unimportant for the purposes of stack probing? The write exists only to poke the page (lightly) to ensure there's no page that is not rw, probing does not particularly care if the data at the relevant addresses are overwritten, or even that the byte is written to the address in the first place (as long as it triggers page permission checks). The only concern could be that some signal handler or an interrupt put some data there, but when these are running the probing itself is suspended, isn't it?

I agree. If the goal is just doing stack probing then it should be fine.

Mine was more of a generic "be careful about the red-zone". Just to make sure that it was considered in this design. I didn't know about your particular use case scenario though.

serge-sans-paille added a comment.EditedMar 23 2021, 6:19 AM

Besides the accurate uwtables, there's another conflicting use-case here – signal handlers. If we sub and then probe, we may receive a signal (and allocate a stack slot for it) in a potentially invalid stack area. Whereas if we probe and then sub, any signal handlers that occur during probing would still execute on what is significantly more likely to be a valid stack.

If my understanding is correct, (and in the case where an alternate stack is not used) the signal handler begins with a push rbp so it should touch the stack and triggers the PAGE_GUARD... mmmh unless we're right at the end of the page guard after the sub, so maybe we should subq $4088 ?