This is an archive of the discontinued LLVM Phabricator instance.

[X86] Implement -fzero-call-used-regs option
ClosedPublic

Authored by void on Sep 30 2021, 11:40 AM.

Details

Summary

The "-fzero-call-used-regs" option tells the compiler to zero out
certain registers before the function returns. It's also available as a
function attribute: zero_call_used_regs.

The two upper categories are:

  • "used": Zero out used registers.
  • "all": Zero out all registers, whether used or not.

The individual options are:

  • "skip": Don't zero out any registers. This is the default.
  • "used": Zero out all used registers.
  • "used-arg": Zero out used registers that are used for arguments.
  • "used-gpr": Zero out used registers that are GPRs.
  • "used-gpr-arg": Zero out used GPRs that are used as arguments.
  • "all": Zero out all registers.
  • "all-arg": Zero out all registers used for arguments.
  • "all-gpr": Zero out all GPRs.
  • "all-gpr-arg": Zero out all GPRs used for arguments.

This is used to help mitigate Return-Oriented Programming exploits.

Diff Detail

Event Timeline

void created this revision.Sep 30 2021, 11:40 AM
void requested review of this revision.Sep 30 2021, 11:40 AM
Herald added projects: Restricted Project, Restricted Project. · View Herald TranscriptSep 30 2021, 11:40 AM
void added a comment.Sep 30 2021, 11:41 AM

The way I select the registers to zero out seems rather fragile. Is there a nicer way to do this?

clang/include/clang/Basic/AttrDocs.td
6277

s/funcditon/function/

:set spell in vim.

6285

Might be worth a note that skip is helpful for disabling previously occurring instances of the command line flag. Sometimes toolchains have wrapper scripts that force on flags, or are built with the implicit default changed, so being able to explicitly disable a feature can be helpful.

clang/lib/CodeGen/CGCall.cpp
2207

The method name has some duplicate information. Perhaps KindToStr would be more concise?

clang/lib/Sema/SemaDeclAttr.cpp
7856

capital K Kind?

clang/test/CodeGen/zero-call-used-regs.c
220

I don't think any of the expansion sites use a variable number of parameters, or more than one?

clang/test/Sema/zero_call_used_regs.c
7

How about a test for:

void failure __attribute__((zero_call_used_reg("used"),zero_call_used_reg("used-gpr")));
nickdesaulniers requested changes to this revision.Sep 30 2021, 12:55 PM

We'll probably need to investigate code gen a little.

A mainline linux kernel defconfig built with CONFIG_ZERO_CALL_USED_REGS=y enabled doesn't boot, for example. I consider that a blocker before landing this (much-appreciated) feature; marking it as such. (Though it's possible that there are TUs in the kernel that may need to be built with -fzero-call-used-regs=skip that aren't (yet) failing with GCC). I don't observe the kernel getting to start_kernel, which is the arch-agnostic entry point of the boot; so there's likely x86 specific C code invoked before the generic boot code takes over that may not play well with the instrumentation as implemented.

This revision now requires changes to proceed.Sep 30 2021, 12:55 PM
void updated this revision to Diff 376386.Sep 30 2021, 3:31 PM
void marked 3 inline comments as done.

Fixed some spelling and variable names.

void added a comment.Sep 30 2021, 3:31 PM

Hmm...not working on Linux sounds bad. I'll investigate.

clang/include/clang/Basic/AttrDocs.td
6285

Doesn't that apply to all (or most) flags though? I don't think they mention that you can override previously set flags...

clang/test/Sema/zero_call_used_regs.c
7

Should that fail? It seems like the second instance of the attribute should override the first one.

void added inline comments.Sep 30 2021, 3:33 PM
clang/include/clang/Basic/AttrDocs.td
6285

Oh, I see what you mean now. I'll see what I can do.

RKSimon resigned from this revision.Oct 22 2021, 9:13 AM
RKSimon added a reviewer: pengfei.
void added a comment.Oct 25 2021, 3:06 PM

We'll probably need to investigate code gen a little.

A mainline linux kernel defconfig built with CONFIG_ZERO_CALL_USED_REGS=y enabled doesn't boot, for example. I consider that a blocker before landing this (much-appreciated) feature; marking it as such. (Though it's possible that there are TUs in the kernel that may need to be built with -fzero-call-used-regs=skip that aren't (yet) failing with GCC). I don't observe the kernel getting to start_kernel, which is the arch-agnostic entry point of the boot; so there's likely x86 specific C code invoked before the generic boot code takes over that may not play well with the instrumentation as implemented.

I found a likely issue. In arch/x86/kernel/e820.c, the function cpcompare returns a value, but we generate xorq %rax, %rax before returning. So not great. The issue is telling which registers are "live out" of the function (or exit block) so that we don't accidentally zero them out.

void updated this revision to Diff 382155.Oct 25 2021, 4:46 PM

Count sub/super registers as "uses" in terminating instructions.

check-llvm does not pass with this revision nor does it resolve the boot failure that Nick mentioned, just in case that was the intention of the revision :) I ran the kernel through gdb and it still does not appear to get to start_kernel.

void added a comment.Oct 26 2021, 11:31 AM

check-llvm does not pass with this revision nor does it resolve the boot failure that Nick mentioned, just in case that was the intention of the revision :) I ran the kernel through gdb and it still does not appear to get to start_kernel.

Yeah. That was an interim revision. I'm not happy with the register selection code. I still think it needs to be more generic, like in GCC. I need to do that before this feature can go in.

pengfei added inline comments.Oct 26 2021, 8:05 PM
llvm/lib/Target/X86/X86FrameLowering.cpp
549

Do we need to consider VR128XRegClass and VR256XRegClass when hasVLX()?

558

And AMX registers?
GCC also cleans mask(k0~k7) registers. But doesn't clean caller saved gpr registers: https://godbolt.org/z/6xcPh19hv
How about 32 bits? GCC doesn't clean the SSE registers on 32 bits.

609

Why just GPRs?

llvm/lib/Target/X86/X86RegisterInfo.cpp
621

Can we get this info from calling conversion?
Different calling conversions may use different argument registers.

void updated this revision to Diff 389101.Nov 23 2021, 12:11 AM

WIP: Move register selection to a platform-generic place.

Not ready for review just yet.

void updated this revision to Diff 389922.Nov 25 2021, 11:24 PM

WIP: Implement zeroing out registers.

void updated this revision to Diff 403815.Jan 27 2022, 3:36 PM

WIP: Updated to make selecing registers to zero out more generic. Combined a
few tests into one, and made the logic emitting the "xors" a bit nicer.

void updated this revision to Diff 403906.Jan 28 2022, 12:07 AM

WIP: Modify the zeroing of GPRs to be less gross.

void updated this revision to Diff 404236.Jan 28 2022, 11:17 PM

Move the zeroing of registers to before the pop instructions.

void added a comment.Jan 28 2022, 11:25 PM

Hey, y'all!

I've finished the initial part of this feature. I compiled a Linux kernel with it, and it booted via qemu.

Please give it a go and let me know what you think.

void updated this revision to Diff 404326.Jan 29 2022, 8:12 PM

Don't zero out all sub and super registers that are callee-saved.

void updated this revision to Diff 404336.Jan 29 2022, 9:28 PM

Make this only for x86.

void updated this revision to Diff 404754.Jan 31 2022, 3:26 PM

Generate the "isGeneralPurposeRegister" and "isFixedRegister" predicates from
the .td file.

craig.topper added inline comments.Jan 31 2022, 3:31 PM
llvm/lib/Target/X86/X86RegisterInfo.td
647

Should we still have this comment from the non-tablegen implementation "The following may not be "fixed" registers, but we don't want them anyway." applied to the VK*PAIRS and TILE?

craig.topper added inline comments.Jan 31 2022, 3:38 PM
llvm/lib/CodeGen/PrologEpilogInserter.cpp
1182

Should the frontend avoid putting the attribute on "main" instead of making a special case in the backend?

llvm/test/CodeGen/X86/zero-call-used-regs-fmod.ll
16

Is it ok that this popq popped garbage into the %rax after you cleared it? It's not the return value of the function. That's xmm0. This is just a popq because its a shorter encoding than sub 8, %esp

Neat use of tablegen; will take me a bit to wrap my head around it. I'll give this a shot on some kernel builds first thing tomorrow!

llvm/include/llvm/CodeGen/MachineRegisterInfo.h
232–240

are these methods on MachineRegisterInfo used anywhere? It looks like only the ones on TargetRegisterInfo are, IIUC?

Oh, are these related to the .td additions? Something seems asymmetric with these three. Like perhaps we only need isFixedRegister here?

llvm/lib/CodeGen/MachineRegisterInfo.cpp
655–668

are these methods on MachineRegisterInfo used anywhere? It looks like only the ones on TargetRegisterInfo are, IIUC?

llvm/lib/CodeGen/PrologEpilogInserter.cpp
1216–1218

while I'm definitely a fan of the functional abstractions in llvm/include/llvm/ADT/STLExtras.h, llvm::for_each is perhaps my least favorite (when compared to llvm::all_of, llvm::any_of, or llvm::none_of). Consider using just a range-for here (and below) if it would be more concise.

1257
for (const MachineOperand &MO : MI.operands()) {
1269
for (MachineBasicBlock *RestoreBlock : RestoreBlocks)
llvm/lib/Target/X86/X86RegisterInfo.cpp
621

when working on the ppc issue recently (https://reviews.llvm.org/D116424), I recall there being an existing check for whether a register was callee saved.

RegisterClassInfo::getLastCalleeSavedAlias

Can that be reused here or in any of the below?

Do we need to zero stack slots when arguments are passed on the stack (after exhausting the "argument registers?"

llvm/lib/Target/X86/X86RegisterInfo.cpp
621

oops, my last comment here was a draft that I meant to delete before submitting all my previous comments. Feel free to ignore.

void marked 2 inline comments as done.Jan 31 2022, 4:05 PM
void added inline comments.
llvm/lib/CodeGen/PrologEpilogInserter.cpp
1182

Sure.

llvm/test/CodeGen/X86/zero-call-used-regs-fmod.ll
16

It's getting the value back from the pushq %rax instruction, isn't it? We shouldn't be clearing it, though. I'll see if I can fix that.

void updated this revision to Diff 404761.Jan 31 2022, 4:06 PM

Remove "zero-call-used-regs" attribute in front-end if it's "main".

void updated this revision to Diff 404763.Jan 31 2022, 4:08 PM

Remove attribute from "main" function.

void marked 3 inline comments as done.Jan 31 2022, 4:20 PM
void added inline comments.
llvm/include/llvm/CodeGen/MachineRegisterInfo.h
232–240

isFixedRegister is there :-)

I added these to MRI for completeness. It looks like all such predicates are reflected in both places.

llvm/lib/CodeGen/PrologEpilogInserter.cpp
1269

I prefer auto unless absolutely necessary (or to avoid confusion).

craig.topper added inline comments.Jan 31 2022, 4:29 PM
llvm/test/CodeGen/X86/zero-call-used-regs-fmod.ll
16

There was a mistake in my previous comment. The popq is being used in place of add 8, %rsp

In this case it probably is the value from the pushq. But %rax isn't a function argument so it is effectively garbage data. This is X86FrameLowering::emitSPUpdate scavenging %rax to use a pushq/popq instead of subtracting or adding to %rsp.

I think this is being done to align the stack for the call to fmod, but I'm not sure.

I think we can use a pushq like this to allocate stack space that gets overwritten before the popq.

Hey! Looks like Diff 404763 for an x86 defconfig plus CONFIG_ZERO_CALL_USED_REGS=y starts booting! Looks like it panics though trying to launch init(pid 1) though.

[    0.702163] Run /bin/sh as init process
[    0.702913] Failed to execute /bin/sh (error -22)
[    0.703721] Run /sbin/init as init process
[    0.704454] Starting init: /sbin/init exists but couldn't execute it (error -22)
[    0.705702] Run /etc/init as init process
[    0.706390] Run /bin/init as init process
[    0.707037] Run /bin/sh as init process
[    0.707736] Starting init: /bin/sh exists but couldn't execute it (error -22)
[    0.708895] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    0.711246] CPU: 1 PID: 1 Comm: sh Not tainted 5.16.0-12116-g74e25defe135 #6
[    0.712426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[    0.713778] Call Trace:
[    0.714188]  <TASK>
[    0.714578]  dump_stack_lvl+0x65/0x9a
[    0.715190]  panic+0x101/0x295
[    0.715743]  ? _printk+0x54/0x80
[    0.716297]  ? rest_init+0xd0/0xd0
[    0.716882]  kernel_init+0x18b/0x190
[    0.717525]  ret_from_fork+0x22/0x30
[    0.718138]  </TASK>
[    0.721159] Kernel Offset: 0xe800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    0.722977] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. ]---

If I disable CONFIG_ZERO_CALL_USED_REGS=y, I can launch userspace just fine. (Error -22 corresponds to EINVAL FWIW). Perhaps some kernel code related to launching init (or processes, generally) isn't happy with a zero'd register somewhere, or we have another codegen bug. Either way, that needs to be investigated+fixed before merging. Probably could sprinkle subdir-ccflags-y += -fzero_call_used_regs=skip (or whatever) in various Makefiles to pinpoint which object file is affected, then take a look at the disassembly from there. (or add function attributes to get more fine grained).

It looks like the Kconfig detection for CONFIG_CC_HAS_ZERO_CALL_USED_REGS is working correctly now. i.e.

$ ARCH=arm64 make LLVM=1 -j72 defconfig
$ grep -rn ZERO_CALL_USED_REGS .config
$ make LLVM=1 -j72 defconfig
$ grep -rn ZERO_CALL_USED_REGS .config
4230:CONFIG_CC_HAS_ZERO_CALL_USED_REGS=y
4231:# CONFIG_ZERO_CALL_USED_REGS is not set

(LGTM)

llvm/lib/CodeGen/PrologEpilogInserter.cpp
1269

What does the style guide say?
https://llvm.org/docs/CodingStandards.html#use-auto-type-deduction-to-make-code-more-readable

Don’t “almost always” use auto

At the very least you aught to use auto * here rather than auto.
https://llvm.org/docs/CodingStandards.html#beware-unnecessary-copies-with-auto

arsenm added inline comments.Feb 1 2022, 4:22 PM
llvm/lib/CodeGen/PrologEpilogInserter.cpp
1259

No reference

1269

I think auto is worse in 95% of situations

void updated this revision to Diff 405117.Feb 1 2022, 4:26 PM
  • Don't use "llvm::for_each" or "auto" as much.
  • Don't zero out registers that are restored before exit.
void updated this revision to Diff 405119.Feb 1 2022, 4:29 PM
void marked 5 inline comments as done.

Remove debugging instruction.

llvm/lib/CodeGen/PrologEpilogInserter.cpp
1269

Okay.

void added a comment.Feb 1 2022, 4:43 PM

Hey! Looks like Diff 404763 for an x86 defconfig plus CONFIG_ZERO_CALL_USED_REGS=y starts booting! Looks like it panics though trying to launch init(pid 1) though.

[    0.702163] Run /bin/sh as init process
[    0.702913] Failed to execute /bin/sh (error -22)
[    0.703721] Run /sbin/init as init process
[    0.704454] Starting init: /sbin/init exists but couldn't execute it (error -22)
[    0.705702] Run /etc/init as init process
[    0.706390] Run /bin/init as init process
[    0.707037] Run /bin/sh as init process
[    0.707736] Starting init: /bin/sh exists but couldn't execute it (error -22)
[    0.708895] Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance.
[    0.711246] CPU: 1 PID: 1 Comm: sh Not tainted 5.16.0-12116-g74e25defe135 #6
[    0.712426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[    0.713778] Call Trace:
[    0.714188]  <TASK>
[    0.714578]  dump_stack_lvl+0x65/0x9a
[    0.715190]  panic+0x101/0x295
[    0.715743]  ? _printk+0x54/0x80
[    0.716297]  ? rest_init+0xd0/0xd0
[    0.716882]  kernel_init+0x18b/0x190
[    0.717525]  ret_from_fork+0x22/0x30
[    0.718138]  </TASK>
[    0.721159] Kernel Offset: 0xe800000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    0.722977] ---[ end Kernel panic - not syncing: No working init found.  Try passing init= option to kernel. See Linux Documentation/admin-guide/init.rst for guidance. ]---

If I disable CONFIG_ZERO_CALL_USED_REGS=y, I can launch userspace just fine. (Error -22 corresponds to EINVAL FWIW). Perhaps some kernel code related to launching init (or processes, generally) isn't happy with a zero'd register somewhere, or we have another codegen bug. Either way, that needs to be investigated+fixed before merging. Probably could sprinkle subdir-ccflags-y += -fzero_call_used_regs=skip (or whatever) in various Makefiles to pinpoint which object file is affected, then take a look at the disassembly from there. (or add function attributes to get more fine grained).

Strange. It boots to a login prompt for me under QEMU....

Diff 405119 still kernel panics for me. This is building the linux-next tree.

$ qemu-system-x86_64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-1)
llvm/lib/Target/X86/X86FrameLowering.cpp
516

Dead store to MBBI

return FirstCSPop;

void updated this revision to Diff 405137.Feb 1 2022, 6:50 PM

Remove dead assign.

void added a comment.Feb 2 2022, 1:21 PM

Diff 405119 still kernel panics for me. This is building the linux-next tree.

$ qemu-system-x86_64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-1)

Okay. I can replicate this in llvm-next. Investigating.

void updated this revision to Diff 405531.Feb 2 2022, 11:30 PM

Make sure we're zeroing the registers in the return block, not just some random
block.

void added a comment.Feb 2 2022, 11:31 PM

Diff 405119 still kernel panics for me. This is building the linux-next tree.

$ qemu-system-x86_64 --version
QEMU emulator version 6.2.0 (Debian 1:6.2+dfsg-1)

Works now. PTAL.

The latest revision allows me to boot a kernel in QEMU now but that same kernel does not boot on bare metal. I'll see if I can narrow down the problematic translation unit with Nick's subdir-ccflags-y trick.

This diff allows me to boot on bare metal as of v5.17-rc2:

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 6aef9ee28a39..8ee176dac669 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -125,6 +125,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o

 obj-$(CONFIG_KVM_GUEST)                += kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)         += paravirt.o
+CFLAGS_paravirt.o += -fzero-call-used-regs=skip
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)   += pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

I have uploaded the config used, the preprocessed file, and the "good" object (with the following diff) and the "bad" object (without the above diff) here: https://github.com/nathanchance/bug-files/tree/052a31e6d94c1b349cf6f3128087944444dace24/D110869

If there is any more information I can give, please let me know!

void added a comment.Feb 3 2022, 1:54 PM

This diff allows me to boot on bare metal as of v5.17-rc2:

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 6aef9ee28a39..8ee176dac669 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -125,6 +125,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o

 obj-$(CONFIG_KVM_GUEST)                += kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)         += paravirt.o
+CFLAGS_paravirt.o += -fzero-call-used-regs=skip
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)   += pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

I have uploaded the config used, the preprocessed file, and the "good" object (with the following diff) and the "bad" object (without the above diff) here: https://github.com/nathanchance/bug-files/tree/052a31e6d94c1b349cf6f3128087944444dace24/D110869

If there is any more information I can give, please let me know!

Thanks for the information!

Could you test this patch for me? (Applied over the patch in this review.)

diff --git a/llvm/lib/CodeGen/PrologEpilogInserter.cpp b/llvm/lib/CodeGen/PrologEpilogInserter.cpp
index 968a14548813..46ae48bd6a3c 100644
--- a/llvm/lib/CodeGen/PrologEpilogInserter.cpp
+++ b/llvm/lib/CodeGen/PrologEpilogInserter.cpp
@@ -1217,7 +1217,8 @@ void PEI::insertZeroCallUsedRegs(MachineFunction &MF) {
             continue;

           MCRegister Reg = MO.getReg();
-          if ((MO.isDef() || MO.isUse()) && AllocatableSet[Reg])
+          if (AllocatableSet[Reg] && !MO.isImplicit() &&
+              (MO.isDef() || MO.isUse()))
             UsedRegs.set(Reg);
         }

This diff allows me to boot on bare metal as of v5.17-rc2:

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index 6aef9ee28a39..8ee176dac669 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -125,6 +125,7 @@ obj-$(CONFIG_DEBUG_NMI_SELFTEST) += nmi_selftest.o

 obj-$(CONFIG_KVM_GUEST)                += kvm.o kvmclock.o
 obj-$(CONFIG_PARAVIRT)         += paravirt.o
+CFLAGS_paravirt.o += -fzero-call-used-regs=skip
 obj-$(CONFIG_PARAVIRT_SPINLOCKS)+= paravirt-spinlocks.o
 obj-$(CONFIG_PARAVIRT_CLOCK)   += pvclock.o
 obj-$(CONFIG_X86_PMEM_LEGACY_DEVICE) += pmem.o

I have uploaded the config used, the preprocessed file, and the "good" object (with the following diff) and the "bad" object (without the above diff) here: https://github.com/nathanchance/bug-files/tree/052a31e6d94c1b349cf6f3128087944444dace24/D110869

If there is any more information I can give, please let me know!

Thanks for the information!

Could you test this patch for me? (Applied over the patch in this review.)

diff --git a/llvm/lib/CodeGen/PrologEpilogInserter.cpp b/llvm/lib/CodeGen/PrologEpilogInserter.cpp
index 968a14548813..46ae48bd6a3c 100644
--- a/llvm/lib/CodeGen/PrologEpilogInserter.cpp
+++ b/llvm/lib/CodeGen/PrologEpilogInserter.cpp
@@ -1217,7 +1217,8 @@ void PEI::insertZeroCallUsedRegs(MachineFunction &MF) {
             continue;

           MCRegister Reg = MO.getReg();
-          if ((MO.isDef() || MO.isUse()) && AllocatableSet[Reg])
+          if (AllocatableSet[Reg] && !MO.isImplicit() &&
+              (MO.isDef() || MO.isUse()))
             UsedRegs.set(Reg);
         }

I tested this patch but it did not change the hang.

void added a comment.Feb 3 2022, 2:34 PM

I tested this patch but it did not change the hang.

Okay. We need to determine which function is failing. Could you do something like this:

#define zcur __attribute__((zero_call_used_regs("used-gpr")))

and then add zcur to each function, while compiling the module with -fzero-call-used-regs=skip. Then remove the zcur attribute from functions until it works? I know it's a PITA, but otherwise I'm just guessing at which function's causing the failure...

It looks like _paravirt_ident_64() is the problematic function. This diff on top of v5.17-rc2 allows me to boot:

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 4420499f7bb4..c1b68504136c 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -96,7 +96,7 @@ static unsigned paravirt_patch_call(void *insn_buff, const void *target,

 #ifdef CONFIG_PARAVIRT_XXL
 /* identity function, which can be inlined */
-u64 notrace _paravirt_ident_64(u64 x)
+u64 notrace __attribute__((zero_call_used_regs("skip"))) _paravirt_ident_64(u64 x)
 {
        return x;
 }

Rather interesting function to have problems with as a result of this patch but it seems like this function is being used in a very specific way further down the file with the __PV_IS_CALLEE_SAVE macro.

void added a comment.Feb 3 2022, 5:31 PM

It looks like _paravirt_ident_64() is the problematic function. This diff on top of v5.17-rc2 allows me to boot:

diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 4420499f7bb4..c1b68504136c 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -96,7 +96,7 @@ static unsigned paravirt_patch_call(void *insn_buff, const void *target,

 #ifdef CONFIG_PARAVIRT_XXL
 /* identity function, which can be inlined */
-u64 notrace _paravirt_ident_64(u64 x)
+u64 notrace __attribute__((zero_call_used_regs("skip"))) _paravirt_ident_64(u64 x)
 {
        return x;
 }

Rather interesting function to have problems with as a result of this patch but it seems like this function is being used in a very specific way further down the file with the __PV_IS_CALLEE_SAVE macro.

Weird. We generate similar code to GCC:

Clang:
_paravirt_ident_64:                     # @_paravirt_ident_64
.Lfunc_begin2:
        .loc    2 100 0 is_stmt 1               # arch/x86/kernel/paravirt.c:100:0
        .cfi_startproc
# %bb.0:                                # %entry
        #DEBUG_VALUE: _paravirt_ident_64:x <- $rdi
        movq    %rdi, %rax
.Ltmp21:
        .loc    2 101 2 prologue_end            # arch/x86/kernel/paravirt.c:101:2
        xorq    %rdi, %rdi
.Ltmp22:
        #DEBUG_VALUE: _paravirt_ident_64:x <- $rax
        retq
.Ltmp23:
.Lfunc_end2:
        .size   _paravirt_ident_64, .Lfunc_end2-_paravirt_ident_64
        .cfi_endproc

GCC:
_paravirt_ident_64:
# arch/x86/kernel/paravirt.c:100: {
        movq    %rdi, %rax      # tmp85, x
# arch/x86/kernel/paravirt.c:102: }
        xorl    %edi, %edi      #
        ret
        .size   _paravirt_ident_64, .-_paravirt_ident_64

I'm a bit confused...

Weird. We generate similar code to GCC:

Clang:
_paravirt_ident_64:                     # @_paravirt_ident_64
        movq    %rdi, %rax
        xorq    %rdi, %rdi
        retq

GCC:
_paravirt_ident_64:
        movq    %rdi, %rax      # tmp85, x
        xorl    %edi, %edi      #
        ret

Does xorl not zero the upper 32b?

Definitely something magical about this function. Perhaps it should have the function attribute to disable zeroing added to the kernel sources?

tools/objtool/objtool.c mentions something about _paravirt_ident_64 and paravirt_patch.

I'm curious @nathanchance if you move the definition of _paravirt_ident_64 to an external assembler file, and see whether the mere xorl vs xorq makes a difference for some reason for boot?

void added a comment.Feb 3 2022, 11:05 PM

Weird. We generate similar code to GCC:

Clang:
_paravirt_ident_64:                     # @_paravirt_ident_64
        movq    %rdi, %rax
        xorq    %rdi, %rdi
        retq

GCC:
_paravirt_ident_64:
        movq    %rdi, %rax      # tmp85, x
        xorl    %edi, %edi      #
        ret

Does xorl not zero the upper 32b?

I'm thinking no. But it's odd, because both are using %rdi but GCC is only zeroing out the bottom 32-bits. Seems a bit counter-intuitive.

Definitely something magical about this function. Perhaps it should have the function attribute to disable zeroing added to the kernel sources?

tools/objtool/objtool.c mentions something about _paravirt_ident_64 and paravirt_patch.

I'm curious @nathanchance if you move the definition of _paravirt_ident_64 to an external assembler file, and see whether the mere xorl vs xorq makes a difference for some reason for boot?

Weird. We generate similar code to GCC:

Clang:
_paravirt_ident_64:                     # @_paravirt_ident_64
        movq    %rdi, %rax
        xorq    %rdi, %rdi
        retq

GCC:
_paravirt_ident_64:
        movq    %rdi, %rax      # tmp85, x
        xorl    %edi, %edi      #
        ret

Does xorl not zero the upper 32b?

I'm thinking no. But it's odd, because both are using %rdi but GCC is only zeroing out the bottom 32-bits. Seems a bit counter-intuitive.

Every write to a 32-bit register on x86-64 zeros bits 63:32 of the register. xorl %edi, %edi has the same behavior as xorq %rdi, %rdi, but is 1 byte shorter to encode.

void added a comment.Feb 4 2022, 1:01 AM

Weird. We generate similar code to GCC:

Clang:
_paravirt_ident_64:                     # @_paravirt_ident_64
        movq    %rdi, %rax
        xorq    %rdi, %rdi
        retq

GCC:
_paravirt_ident_64:
        movq    %rdi, %rax      # tmp85, x
        xorl    %edi, %edi      #
        ret

Does xorl not zero the upper 32b?

I'm thinking no. But it's odd, because both are using %rdi but GCC is only zeroing out the bottom 32-bits. Seems a bit counter-intuitive.

Every write to a 32-bit register on x86-64 zeros bits 63:32 of the register. xorl %edi, %edi has the same behavior as xorq %rdi, %rdi, but is 1 byte shorter to encode.

Oh, interesting! TIL. So it's really not profitable to use xorq at all here...Though it does beg the question of why xorq exists. :-)

Weird. We generate similar code to GCC:

Clang:
_paravirt_ident_64:                     # @_paravirt_ident_64
        movq    %rdi, %rax
        xorq    %rdi, %rdi
        retq

GCC:
_paravirt_ident_64:
        movq    %rdi, %rax      # tmp85, x
        xorl    %edi, %edi      #
        ret

Does xorl not zero the upper 32b?

I'm thinking no. But it's odd, because both are using %rdi but GCC is only zeroing out the bottom 32-bits. Seems a bit counter-intuitive.

Every write to a 32-bit register on x86-64 zeros bits 63:32 of the register. xorl %edi, %edi has the same behavior as xorq %rdi, %rdi, but is 1 byte shorter to encode.

Oh, interesting! TIL. So it's really not profitable to use xorq at all here...Though it does beg the question of why xorq exists. :-)

xorq is still useful when the registers are different.

void updated this revision to Diff 405897.Feb 4 2022, 2:15 AM

Only need to zero out 32-bit registers with xor.

Rather interesting function to have problems with as a result of this patch but it seems like this function is being used in a very specific way further down the file with the __PV_IS_CALLEE_SAVE macro.

AFAICT, __PV_IS_CALLEE_SAVE is just bogus. It appears to be used in order to promise that a given function preserves all GPR registers (including those registers that are normally caller-saved), as an optimized alternative to generating a wrapper for the function with PC_CALLEE_SAVE_REGS_THUNK, to save the extra registers. That would be fine if the functions it was making the promise about were written in asm, or the compiler was informed about the adjusted calling convention expectation...

However, it's being applied to functions which are implemented in C with no special calling-convention attributes on them . That's a promise that simply cannot be upheld -- I mean, I guess it happens to work currently, but that's super-fragile.

Also the whole mechanism (including the asm thunk) should be unnecessary in Clang, as it provides an attribute you can put on a function to use a calling convention that'll preserve all GPRs except r11. I imagine the call locations could easily be adjusted to avoid needing to have r11 preserved, if that's not already the case. https://clang.llvm.org/docs/AttributeReference.html#preserve-most

jyknight added inline comments.Feb 4 2022, 7:19 AM
clang/include/clang/Basic/AttrDocs.td
6278

I think we need to define "call-used" here. It's not a very common name for this concept, and it's especially confusing because of the double-use of the word "used". I was really confused by the definition of the "all" choice when first reading it -- thinking "Wait, 'all' also only clears the registers which were used? How's that different than 'used', then? Ooooohhhh, 'call-used' doesn't mean it IS USED in the call, only that it's not guaranteed to be PRESERVED."

Suggestion for text:
The term "call-used" means registers which are not guaranteed to be preserved unchanged for the caller by the current calling convention. This could also be described as "caller-saved" or "not callee-saved".

void updated this revision to Diff 406596.Feb 7 2022, 1:43 PM
void marked an inline comment as done.

Better define what we mean by "call-used".

nickdesaulniers accepted this revision.Feb 7 2022, 1:54 PM

I tested the following linux-next configs with CONFIG_ZERO_CALL_USED_REGS enabled:

  • x86_64 defconfig
  • x86_64 defconfig + thin LTO
  • x86_64 defconfig + full LTO
  • i386 defconfig

All built+boot. I know @nathanchance pointed out an issue with some already complex code, but I'm of the opinion that fn should be attributed in kernel sources.

Just some minor nits and questions. It would be good to get additional explicit acceptance from someone more familiar with x86 than me before merging.

clang/test/CodeGen/zero-call-used-regs.c
2–3

If you use --check-prefixes=<unique>,<non-unique> i.e.

--check-prefixes=CHECK-USED-GPR-ARG,CHECK

then you could combine a whole lot of the below CHECK-<unique> into un-suffixed CHECK:.

i.e.

// CHECK-SKIP:               attributes #[[ATTR_NUM_SKIP]] = {{.*}} "zero-call-used-regs"="skip"

could become

// CHECK:               attributes #[[ATTR_NUM_SKIP]] = {{.*}} "zero-call-used-regs"="skip"

for all -fzero-call-used-regs= tests.

Really, the behavior is unique only for the unattributed @no_attribute; this test is much larger than it needs to be.

llvm/include/llvm/Support/CodeGen.h
74–77

Drop the L suffix if these are unsigned (int).

81

drop L suffix

llvm/lib/CodeGen/PrologEpilogInserter.cpp
1203–1205

Do we need to check ENABLED? I'm curious if ZeroCallUsedRegsKind::All will do anything here when set? Looks like no; I think it should be doing something?

llvm/lib/Target/X86/X86FrameLowering.cpp
523

sink the declaration of ST closer to its use (L555?)

576

looks like this loop could be fused with the below loop. X86::XOR32rr is the XorOp.

582–584

technically this case is handled by the

} else {
  continue;

L614

595

Is there any way to figure the XorOp outside of this loop? Seems unfortunate to repeatedly scan almost every register class for every used register.

Like instead of querying each register set whether a given register is in it, is it possible to ask a register what register class it's in? Or can a register belong to more than one register class?

llvm/lib/Target/X86/X86RegisterInfo.cpp
637

I thought the 64b x86 cc used rdi, rsi, rdx, rcx, r8, r9 as registers for arguments?

650–651

so ANY of the X86::VR128RegClass are argument registers?

llvm/test/CodeGen/X86/zero-call-used-regs.ll
187–201

is this how we clear the x87 stack?

288

necessary?

llvm/utils/TableGen/CodeGenRegisters.h
585–586

I know the existing code uses the pre-c++11 typedef, but could we please use the more modern type alias here?
https://en.cppreference.com/w/cpp/language/type_alias

using RCatKeyMap = std::map<CodeGenRegisterClass::Key, CodeGenRegisterCategory *>;
llvm/utils/TableGen/RegisterInfoEmitter.cpp
1633–1649

replace auto w/ explicit type

This revision is now accepted and ready to land.Feb 7 2022, 1:54 PM
void updated this revision to Diff 406646.Feb 7 2022, 4:20 PM
void marked 4 inline comments as done.

General updates NFC..

void added a comment.Feb 7 2022, 4:20 PM

Weird. We generate similar code to GCC:

Clang:
_paravirt_ident_64:                     # @_paravirt_ident_64
        movq    %rdi, %rax
        xorq    %rdi, %rdi
        retq

GCC:
_paravirt_ident_64:
        movq    %rdi, %rax      # tmp85, x
        xorl    %edi, %edi      #
        ret

Does xorl not zero the upper 32b?

I'm thinking no. But it's odd, because both are using %rdi but GCC is only zeroing out the bottom 32-bits. Seems a bit counter-intuitive.

Every write to a 32-bit register on x86-64 zeros bits 63:32 of the register. xorl %edi, %edi has the same behavior as xorq %rdi, %rdi, but is 1 byte shorter to encode.

Oh, interesting! TIL. So it's really not profitable to use xorq at all here...Though it does beg the question of why xorq exists. :-)

xorq is still useful when the registers are different.

So xorl %ecx, %edx doesn't zero out all 64-bits of %rcx and %rdx? That's two 32-bit writes to two different registers, isn't it?

clang/test/CodeGen/zero-call-used-regs.c
2–3

The tests don't have exactly the same output. The attributes differ between them.

llvm/lib/CodeGen/PrologEpilogInserter.cpp
1203–1205

I modified it to set All to 0 instead of ENABLED.

llvm/lib/Target/X86/X86FrameLowering.cpp
523

I guess.

595

There's a function in TRI that you can call to grab the RegClass of a register, but it calls a "contains" on each register class to see if it belongs in it, so it would be worse than this code.

In practice, the register classes won't have many members in it. It sucks, but it's probably something like 16*16 in a worst case scenario.

(I think registers can belong to multiple register classes (e.g. sub- and super-classes), but I don't quote me on that.)

llvm/lib/Target/X86/X86RegisterInfo.cpp
637

That's captured in the if-then below this one.

650–651

[XYZ]MM0-7 are apparently. I'll make it explicit.

llvm/test/CodeGen/X86/zero-call-used-regs.ll
187–201

Yes.

288

It's main. Of course it's necessary. :-) (And I want to check main explicitly.)

nickdesaulniers added inline comments.Feb 7 2022, 4:34 PM
llvm/test/CodeGen/X86/zero-call-used-regs.ll
288

specifically can the store volatile i32 2, i32* @result, align 4 be removed?

All built+boot. I know @nathanchance pointed out an issue with some already complex code, but I'm of the opinion that fn should be attributed in kernel sources.

Aside from that one diff I posted above adding __attribute__((zero_call_used_regs("skip"))) to _paravirt_ident_64(), I have not noticed any issues with diff 406596 on my two test machines; they boot right up and I am currently running them through my usual build tests. I will comment back if anything bad turns up.

So xorl %ecx, %edx doesn't zero out all 64-bits of %rcx and %rdx? That's two 32-bit writes to two different registers, isn't it?

xorl %ecx, %edx only zero out bit 63:32 of rdx.

  1. There's only 1 write to register in the instruction, i.e. %edx;
  2. As a src, none bit of %ecx will be changed after the instruction;
  3. xorl is not a zeroing instruction, bit 31:0 of rdx happens to zero only if %ecx == %edx ;

So the values after the instuction are:

RCX = RCX_old
RDX[63:32] = 0
RDX[31:0] = RCX_old[31:0] ^ RDX_old[31:0]
kees added a comment.Feb 8 2022, 12:39 AM

I can build and boot with this. Nice! :) One issue I see is in instruction sequence ordering.

Looking at the end of __startup_64 without the feature enabled, everything looks "normal":

31 c0      xor    %eax,%eax
5b         pop    %rbx
41 5e      pop    %r14
41 5f      pop    %r15
5d         pop    %rbp
c3         ret

with -fzero-call-used-regs=used-gpr:

31 c0      xor    %eax,%eax
31 c9      xor    %ecx,%ecx
31 ff      xor    %edi,%edi
31 d2      xor    %edx,%edx
31 f6      xor    %esi,%esi
45 31 c0   xor    %r8d,%r8d
45 31 c9   xor    %r9d,%r9d
45 31 d2   xor    %r10d,%r10d
45 31 db   xor    %r11d,%r11d
5b         pop    %rbx
41 5e      pop    %r14
41 5f      pop    %r15
5d         pop    %rbp
c3         ret

The registers are being wiped, and that's what's needed for the "reducing data lifetime" part of this feature, but because of the ordering, it is not really making ROP attacks harder since all the pop instructions are available before the ret. I would have expected:

31 c0      xor    %eax,%eax
5b         pop    %rbx
41 5e      pop    %r14
41 5f      pop    %r15
5d         pop    %rbp
31 c9      xor    %ecx,%ecx
31 ff      xor    %edi,%edi
31 d2      xor    %edx,%edx
31 f6      xor    %esi,%esi
45 31 c0   xor    %r8d,%r8d
45 31 c9   xor    %r9d,%r9d
45 31 d2   xor    %r10d,%r10d
45 31 db   xor    %r11d,%r11d
c3         ret

And looking at the results with GCC, that's effectively the case. (They're still using xor+mov, but I'm expecting them to switch to all-xor.)

31 c0      xor    %eax,%eax
5b         pop    %rbx
5d         pop    %rbp
41 5c      pop    %r12
31 d2      xor    %edx,%edx
89 d1      mov    %edx,%ecx
89 d6      mov    %edx,%esi
89 d7      mov    %edx,%edi
41 89 d0   mov    %edx,%r8d
41 89 d1   mov    %edx,%r9d
41 89 d2   mov    %edx,%r10d
41 89 d3   mov    %edx,%r11d
c3         ret

With that swapped around, this will be looking great! :)

Some more details about the anti-ROP-ness are here:
https://git.kernel.org/linus/a82adfd5c7cb4b8bb37ef439aed954f9972bb618

nickdesaulniers added inline comments.Feb 8 2022, 9:09 AM
llvm/lib/Target/X86/X86FrameLowering.cpp
595

Right, I just get a sinking feeling that we're going to repeatedly scan these RegClass lists for every MachineFunction, when the answer doesn't change; we should eventually be able to map these O(1).

llvm/lib/Target/X86/X86RegisterInfo.cpp
629–633

Do we need to clear stack slots for i386?

The Linux kernel uses -mreg-param=3 to use a faster, though custom calling convention. This corresponds to the inreg parameter attribute in LLVM IR.

Otherwise, perhaps a todo and diagnose -m32 in the front end?

void added inline comments.Feb 8 2022, 12:11 PM
llvm/lib/Target/X86/X86FrameLowering.cpp
595

That could be a separate change, since it's incidental to this feature. We could create a map of this information, which should help performance a lot.

llvm/lib/Target/X86/X86RegisterInfo.cpp
629–633

Clearing stack slots for i386 isn't done by GCC (https://godbolt.org/z/af8e61d3q). I think we can omit doing that.

void updated this revision to Diff 406924.Feb 8 2022, 12:12 PM

Move register clearning to after the "pop"s.

void updated this revision to Diff 406948.Feb 8 2022, 1:19 PM

Remove obsolete change.

This revision was landed with ongoing or failed builds.Feb 8 2022, 5:43 PM
This revision was automatically updated to reflect the committed changes.
nikic added a subscriber: nikic.Feb 9 2022, 12:54 AM

It looks like this breaks expensive checks builds due to machine verifier errors: https://lab.llvm.org/buildbot/#/builders/16/builds/24103

void added a comment.Feb 9 2022, 1:04 AM

It looks like this breaks expensive checks builds due to machine verifier errors: https://lab.llvm.org/buildbot/#/builders/16/builds/24103

Sorry about that. Looking into it now.

@void This buildbot appears to be still broken due to this change: https://lab.llvm.org/buildbot/#/builders/110/builds/10271

void added a comment.Feb 10 2022, 10:26 PM

@void This buildbot appears to be still broken due to this change: https://lab.llvm.org/buildbot/#/builders/110/builds/10271

Strange...This is the first time seeing something like this. I'll fix it quickly.

void added a comment.Feb 11 2022, 10:15 AM

@void This buildbot appears to be still broken due to this change: https://lab.llvm.org/buildbot/#/builders/110/builds/10271

Strange...This is the first time seeing something like this. I'll fix it quickly.

It's fixed now. Sorry about the failure!

@void It would be cool to mention this new feature in release notes.

void added a comment.Feb 11 2022, 1:47 PM

@void It would be cool to mention this new feature in release notes.

Good idea! https://reviews.llvm.org/D119592

ychen added a subscriber: ychen.Sep 13 2022, 9:17 AM
Herald added a project: Restricted Project. · View Herald TranscriptSep 13 2022, 9:17 AM