This is an archive of the discontinued LLVM Phabricator instance.

[PEI] add dwarf information for stack probe
AbandonedPublic

Authored by YangKeao on Mar 17 2021, 9:14 AM.

Details

Summary

While probing stack, the stack register is moved without
dwarf information, which could cause panic if unwind the
backtrace at that point. This commit add dwarf information
for these operation, and use r11 (instead of rsp) to iterate
over pages to probe.

Diff Detail

Event Timeline

YangKeao created this revision.Mar 17 2021, 9:14 AM
YangKeao requested review of this revision.Mar 17 2021, 9:14 AM
Herald added a project: Restricted Project. · View Herald TranscriptMar 17 2021, 9:14 AM

There has been a bug report for this on bugzilla. A more "downstream" context for this feature is discussed in rust#83139.

nagisa added a subscriber: nagisa.
nagisa added inline comments.
llvm/test/CodeGen/X86/stack-clash-unknown-call.ll
18

The def_cfa_offset is superfluous here now. The last cfi directive should be either the adjust or def.

Generally, unlike the formats used on Windows etc., DWARF unwind isn't accurate in the prologue. I mean, you could add separate unwind info for each relevant instruction, but that would be a lot of data, and we currently don't have any option to do that.

Given that, I'm not sure what you're trying to accomplish here. I can't see how adding more .cfi_adjust_cfa_offset directives does anything useful.

Err, sorry, please pretend I didn't write that. I have no idea what I just wrote.

To clarify, I've done some more reading now, and figured out where I went wrong. For a long time, LLVM did not emit accurate unwind info to describe the prologue/epilogue (and still doesn't on some targets), so I was under the impression it wasn't possible. Clearly, it is, and it's implemented on x86.


The change to use r11 isn't implemented correctly: we can't adjust the stack pointer until *after* we've probed the relevant pages. It'll appear to work, but it won't actually provide complete protection if a signal handler triggers at the wrong time.

YangKeao added a comment.EditedMar 17 2021, 10:22 PM

To clarify, I've done some more reading now, and figured out where I went wrong. For a long time, LLVM did not emit accurate unwind info to describe the prologue/epilogue (and still doesn't on some targets), so I was under the impression it wasn't possible. Clearly, it is, and it's implemented on x86.


The change to use r11 isn't implemented correctly: we can't adjust the stack pointer until *after* we've probed the relevant pages. It'll appear to work, but it won't actually provide complete protection if a signal handler triggers at the wrong time.

Make sense. I tried to use r11 + offset to represent CFA temporarily. However, r11d cannot be used as a dwarf register on x86_32. Can I use another register (like di) here?

YangKeao updated this revision to Diff 331477.Mar 18 2021, 12:40 AM
  • Use rdi to represent the stack bound and CFA
  • Remove extra tailing offset adjust
YangKeao marked an inline comment as done.Mar 18 2021, 12:44 AM
YangKeao added inline comments.
llvm/test/CodeGen/X86/stack-clash-unknown-call.ll
18

The def_cfa_offset is superfluous here now. The last cfi directive should be either the adjust or def.

18

Fixed by removing the last adjust

YangKeao updated this revision to Diff 331503.Mar 18 2021, 3:21 AM

reformat the code

Make sense. I tried to use r11 + offset to represent CFA temporarily. However, r11d cannot be used as a dwarf register on x86_32.

What registers can be used? I did a quick search and couldn't find anything.

Can I use another register (like di) here?

Any register that isn't callee-preserved could be used (or any register in general if its spilled first), assuming it isn't used for something else already. I did a quick search in an effort to figure out which registers are callee-saved on x86, but couldn't find anything definitive :(

This makes me wonder, though: why not leave selection of the register to use here to regalloc?

llvm/test/CodeGen/X86/stack-clash-large.ll
22

I… think this wants to be a def_cfa_offset? def_cfa_register does not reset the offset so its not at all obvious what this is offsetting from.

Alternatively there's a form that combines both setting the new register and the offset into a single directive:

.cfi_def_cfa %rdi, 69632
YangKeao added a comment.EditedMar 18 2021, 8:45 AM

What registers can be used? I did a quick search and couldn't find anything.

I found the available dwarf registers under the x86 register table in LLVM (X86RegisterInfo.td).

Any register that isn't callee-preserved could be used (or any register in general if its spilled first), assuming it isn't used for something else already. I did a quick search in an effort to figure out which registers are callee-saved on x86, but couldn't find anything definitive :(

It seems that different platform will have different set of callee-preserved register. I also found them in the list in LLVM (which is refered in the function getCalleeSavedRegs of llvm/lib/Target/X86/X86RegisterInfo.cpp). Sadly, it seems like RDI is more likely to be a callee-saved register.

Only considering the SYSV and Windows, all callee-saved registers are:

X86::RBX, X86::R12, X86::R13, X86::R14, X86::R15, X86::RBP
X86::RBX, X86::RBP, X86::RDI, X86::RSI, X86::R12, X86::R13, X86::R14, X86::R15, X86::XMM6, X86::XMM7, X86::XMM8, X86::XMM9, X86::XMM10, X86::XMM11, X86::XMM12, X86::XMM13, X86::XMM14, X86::XMM15

And all available dwarf registers are:

32bit:
EAX, EDX, ECX, EBX, ESI, EDI, EBP, ESP, EIP
64bit, X86-64:
RAX, RDX, RCX, RBX, RSI, RDI, RBP, RSP, R8-R15, RIP

For 64bit, RAX, RCX, RDX, R8-R11 could be a good choice, and for 32 bit, it could only be choosen from EAX, ECX, EDX

This makes me wonder, though: why not leave selection of the register to use here to regalloc?

Sounds like a good idea. But I'm not familiar with LLVM codes. Is there any example on how to use the regalloc?

(I don't know whether it will make this patch too complicate. Allocating stack probe register with regalloc and fixing the dwarf sound like two things and I'd like to do it in a separated patch.)

llvm/test/CodeGen/X86/stack-clash-large.ll
22

The problem is that I don't know the accurate offset. When the callee-saved registers are pushed to the stack, there will be an offset before probing the stack (if I understand the prolog part correctly, please tell me if I'm wrong).

YangKeao updated this revision to Diff 331583.Mar 18 2021, 9:08 AM

Use RDX/EDX instead of RDI/EDI, as RDI/EDI is callee saved on Win64 calling convension

Any register that isn't callee-preserved could be used (or any register in general if its spilled first), assuming it isn't used for something else already. I did a quick search in an effort to figure out which registers are callee-saved on x86, but couldn't find anything definitive :(

It seems that different platform will have different set of callee-preserved register. I also found them in the list in LLVM (which is refered in the function getCalleeSavedRegs of llvm/lib/Target/X86/X86RegisterInfo.cpp). Sadly, it seems like RDI is more likely to be a callee-saved register.

Only considering the SYSV and Windows, all callee-saved registers are:

X86::RBX, X86::R12, X86::R13, X86::R14, X86::R15, X86::RBP
X86::RBX, X86::RBP, X86::RDI, X86::RSI, X86::R12, X86::R13, X86::R14, X86::R15, X86::XMM6, X86::XMM7, X86::XMM8, X86::XMM9, X86::XMM10, X86::XMM11, X86::XMM12, X86::XMM13, X86::XMM14, X86::XMM15

And all available dwarf registers are:

32bit:
EAX, EDX, ECX, EBX, ESI, EDI, EBP, ESP, EIP
64bit, X86-64:
RAX, RDX, RCX, RBX, RSI, RDI, RBP, RSP, R8-R15, RIP

For 64bit, RAX, RCX, RDX, R8-R11 could be a good choice, and for 32 bit, it could only be choosen from EAX, ECX, EDX

A care must be taken to not overwrite the arguments as well. For instance on SysV x86_64 ABI rdi, rsi, rdx, rcx, r8, r9 are used to pass in integer arguments. For functions with a small number of arguments one of these could be reused, but if a function happens to use all of them, unconditional use of rdx would clobber the argument.

Sounds like a good idea. But I'm not familiar with LLVM codes. Is there any example on how to use the regalloc?

(I don't know whether it will make this patch too complicate. Allocating stack probe register with regalloc and fixing the dwarf sound like two things and I'd like to do it in a separated patch.)

You would probably have to split the patches up into distinct parts, yes (first one adjusting the backend to allocate virtual registers and the next being this one).

I'll get back to you with advice on how to specify a virtual register a little bit later –

nagisa added a comment.EditedMar 18 2021, 10:01 AM

See createVirtualRegister.

YangKeao updated this revision to Diff 331618.Mar 18 2021, 10:38 AM

Use RAX/EAX as the iterate register, as RDX/EDX is used as arguments under systemv

YangKeao added a comment.EditedMar 18 2021, 10:39 AM

A care must be taken to not overwrite the arguments as well. For instance on SysV x86_64 ABI rdi, rsi, rdx, rcx, r8, r9 are used to pass in integer arguments. For functions with a small number of arguments one of these could be reused, but if a function happens to use all of them, unconditional use of rdx would clobber the argument.

Nice catch! After eliminating RCX, RDX from the selections, it seems that the only possible "always correct" choice is RAX (being used as the return value doesn't bother).

See createVirtualRegister.

Thanks. I have seen some usage of this function. It seems that there isn't any suitable RegClass for this situation ("dwarf suitable registers class", is there an equivalent one?). Should I create one by myself?

I investigated this a little bit, and it seems like using createVirtualRegister won't work here, after all. It seems that this code does in fact run after regalloc happens. I'm out of ideas, but I also am not that familiar with this area to give any useful advice or guidance here.

Normally, I'd expect some register is naturally free in the prologue, but you could get into weird situations. On 32-bit specifically, consider compiling with -mregparm=3; I think there are no registers which are unconditionally safe in that case. One possibility is to always use EAX, and just save/restore it if necessary. See isEAXAlive in X86FrameLowering::emitPrologue.

Alternatively, you could ensure that some callee-save GPR is spilled, and explicitly use that register. This is taking advantage of the fact this is part of the prologue: there can't be any other uses of callee-save registers at that point. (In theory, it might be possibly for an exotic calling convention to have no callee-save registers, but I don't think there are any in practice.)

Outside the prologue, the allocation should be represented by some instruction; that instruction should clobber some register, and regalloc will ensure that register is free.

efriedma added inline comments.Mar 18 2021, 9:24 PM
llvm/test/CodeGen/X86/stack-clash-large.ll
38

BTW, this is completely broken; r11d doesn't exist on 32-bit x86.

YangKeao added inline comments.Mar 18 2021, 10:39 PM
llvm/test/CodeGen/X86/stack-clash-large.ll
38

Wow, surprising discovery. (I think) a "bad register name" should be given when compiling this codes?

Is there any pass which will omit this problem? Run clang -fstack-clash-protection -m32 -fomit-frame-pointer -S will generate codes containing r11d, which is bad. However, clang -fstack-clash-protection -m32 -fomit-frame-pointer -c and disassemble the output, the register used here will become ebx.

Normally, I'd expect some register is naturally free in the prologue, but you could get into weird situations. On 32-bit specifically, consider compiling with -mregparm=3; I think there are no registers which are unconditionally safe in that case. One possibility is to always use EAX, and just save/restore it if necessary. See isEAXAlive in X86FrameLowering::emitPrologue.

Alternatively, you could ensure that some callee-save GPR is spilled, and explicitly use that register. This is taking advantage of the fact this is part of the prologue: there can't be any other uses of callee-save registers at that point. (In theory, it might be possibly for an exotic calling convention to have no callee-save registers, but I don't think there are any in practice.)

Both solutions seems to be either complicate or with extra cost. And surprisingly found that the original implementation of stack probe is wrong on 32bit. Given that, I preferred to only provide DWARF information in 64bit situation (and left a comment), so that r11 can be used and solve this problem easily.

YangKeao updated this revision to Diff 331774.EditedMar 18 2021, 11:59 PM

Remove dwarf information for 32bit and use R11 as the iterate bound / dwarf register

YangKeao updated this revision to Diff 331775.Mar 19 2021, 12:13 AM

left comments about 32bit

btw I prototyped a D98906: [X86] Improve lowering of the unrolled inline-asm probing yesterday as an alternative approach towards improving the unrolled case.

btw I prototyped a D98906: [X86] Improve lowering of the unrolled inline-asm probing yesterday as an alternative approach towards improving the unrolled case.

Great! It seems better. I will rebase on it.

Looks like my alternative has caveats of its own. Can you please split your change into two parts? One that affects only the unrolled case (which I believe should be good to land) and the other which affects the loop case (potentially subject to further discussion).

@YangKeao Will you be pursuing this further? Should I take over this for you?

@YangKeao Will you be pursuing this further? Should I take over this for you?

Oops. Sorry, I missed the former comment "Looks like my alternative has caveats of its own.". I will split this patch into two parts right now. Thanks.

YangKeao abandoned this revision.Apr 15 2021, 11:02 AM

Stack probing with loop should be discussed further in D99585.