This is an archive of the discontinued LLVM Phabricator instance.

[CodeGen] Zero out only modified registers
AbandonedPublic

Authored by void on Aug 17 2022, 2:40 PM.

Details

Summary

Zeroing out used, but not modified, registers can destroy the contents
of non-volatile registers. This can lead to programs crashing in horky
ways.

Because of this, I'm wary of the -fzero-call-used-regs feature with the
"all-*" options. It would have to be used with great care, which means
it probably has an extremely limited use.

Link: https://github.com/KSPP/linux/issues/192

Diff Detail

Event Timeline

void created this revision.Aug 17 2022, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2022, 2:40 PM
void requested review of this revision.Aug 17 2022, 2:40 PM
Herald added a project: Restricted Project. · View Herald TranscriptAug 17 2022, 2:40 PM
void edited the summary of this revision. (Show Details)Aug 17 2022, 2:46 PM

Zeroing out used, but not modified, registers can destroy the contents of non-volatile registers

Do you have a testcase? All the changes to the regression tests involve argument registers, which are not preserved across calls. (At least, by default; -mllvm -enable-ipra exists, but it's disabled at all optimization levels.)

void added a comment.Aug 17 2022, 3:52 PM

Zeroing out used, but not modified, registers can destroy the contents of non-volatile registers

Do you have a testcase? All the changes to the regression tests involve argument registers, which are not preserved across calls. (At least, by default; -mllvm -enable-ipra exists, but it's disabled at all optimization levels.)

I do, but it's rather large. I'll see if I can trim it down a bit.

void updated this revision to Diff 456790.Aug 30 2022, 2:04 PM

Update testcase.

void added a comment.Aug 30 2022, 2:09 PM

Zeroing out used, but not modified, registers can destroy the contents of non-volatile registers

Do you have a testcase? All the changes to the regression tests involve argument registers, which are not preserved across calls. (At least, by default; -mllvm -enable-ipra exists, but it's disabled at all optimization levels.)

I think this should be done now. PTAL.

Can you give an example of a function that is miscompiled without your patch, and is not miscompiled with you patch? I don't see anything obviously wrong with the way we're compiling the given testcases without the patch.

llvm/test/CodeGen/X86/zero-call-used-regs.ll
26–38

consider adding the nounwind fn attr to your test functions to eliminate all these obnoxious/verbose CFI directives stemming from -fasynchronous-unwind-tables. That should help with readability of the tests significantly.

void added inline comments.Aug 31 2022, 10:58 AM
llvm/test/CodeGen/X86/zero-call-used-regs.ll
26–38

It didn't work. :-/

void updated this revision to Diff 457028.Aug 31 2022, 11:06 AM

Remove uwtable

void updated this revision to Diff 457336.Sep 1 2022, 11:27 AM

Update testcase.

void added a comment.Sep 1 2022, 11:29 AM

Can you give an example of a function that is miscompiled without your patch, and is not miscompiled with you patch? I don't see anything obviously wrong with the way we're compiling the given testcases without the patch.

I believe I have it now. I had to find the issue again... PTAL

llvm/test/CodeGen/X86/zero-call-used-regs.ll
26–38

It didn't work. :-/

It was uwtable. I removed it.

llvm/test/CodeGen/X86/zero-call-used-regs.ll
7

I like that the existing test cases are concise; it makes it easier to understand the behavior for the many different values of zero-call-used-regs.

Rather than update all of the existing tests, would you mind adding your new test case as a new test fn to the file? I suspect since this is affecting the kernel, we'd only need to test used-gpr?

That should significantly reduce the diffstat to this test file to help reviewers better understand the change.

26–38

ah, I'll keep an eye out for that fn attr in the future, too. Thanks for pinpointing it.

Can you give an example of a function that is miscompiled without your patch, and is not miscompiled with you patch? I don't see anything obviously wrong with the way we're compiling the given testcases without the patch.

I believe I have it now. I had to find the issue again... PTAL

What am I looking for? I still don't see anything obviously wrong.

void added a comment.Sep 1 2022, 12:17 PM

Can you give an example of a function that is miscompiled without your patch, and is not miscompiled with you patch? I don't see anything obviously wrong with the way we're compiling the given testcases without the patch.

I believe I have it now. I had to find the issue again... PTAL

What am I looking for? I still don't see anything obviously wrong.

We shouldn't be zeroing out non-volatile registers that aren't modified by the function. In this case, we shouldn't be zeroing %rdi. Note that %rsi is marked as "modified" by the asm statements.

we shouldn't be zeroing %rdi

I don't see how it's a problem if we zero rdi. Even if we don't modify rdi, the caller doesn't have any way of knowing it wasn't modified. The calling convention says it's caller-save.

What am I looking for? I still don't see anything obviously wrong.

We shouldn't be zeroing out non-volatile registers that aren't modified by the function. In this case, we shouldn't be zeroing %rdi. Note that %rsi is marked as "modified" by the asm statements.

rsi and rdi are not in any of the clobber lists...or is that what registers are being allocated for the inline asm inputs/outputs? Perhaps a MIR test would make that more obvious? Or an explicit clobber in these test cases?

void added a comment.Sep 1 2022, 3:00 PM

we shouldn't be zeroing %rdi

I don't see how it's a problem if we zero rdi. Even if we don't modify rdi, the caller doesn't have any way of knowing it wasn't modified. The calling convention says it's caller-save.

This is just a test case. It's not meant to be definitive of the original bug. Just show the same symptoms.

void added a comment.Sep 1 2022, 3:02 PM

What am I looking for? I still don't see anything obviously wrong.

We shouldn't be zeroing out non-volatile registers that aren't modified by the function. In this case, we shouldn't be zeroing %rdi. Note that %rsi is marked as "modified" by the asm statements.

rsi and rdi are not in any of the clobber lists...or is that what registers are being allocated for the inline asm inputs/outputs? Perhaps a MIR test would make that more obvious? Or an explicit clobber in these test cases?

Right, they aren't. Perhaps I should just write a testcase that uses inline asm and a clobber list instead of trying to scrape together something from the Linux sources.

we shouldn't be zeroing %rdi

I don't see how it's a problem if we zero rdi. Even if we don't modify rdi, the caller doesn't have any way of knowing it wasn't modified. The calling convention says it's caller-save.

This is just a test case. It's not meant to be definitive of the original bug. Just show the same symptoms.

The thing is, I don't see the connection between this "symptom" and any possible bug. rdi is callee-clobbered, and the caller has no way of knowing whether the implementation actually writes to rdi, therefore it can't be a bug to add a write to rdi.

void added a comment.EditedSep 1 2022, 3:35 PM

we shouldn't be zeroing %rdi

I don't see how it's a problem if we zero rdi. Even if we don't modify rdi, the caller doesn't have any way of knowing it wasn't modified. The calling convention says it's caller-save.

This is just a test case. It's not meant to be definitive of the original bug. Just show the same symptoms.

The thing is, I don't see the connection between this "symptom" and any possible bug. rdi is callee-clobbered, and the caller has no way of knowing whether the implementation actually writes to rdi, therefore it can't be a bug to add a write to rdi.

I understand. Here's an example of what's going on, at least with the Linux kernel. A function is defined like this:

u64 __attribute__((no_instrument_function)) _paravirt_ident_64(u64 x)
{
        return x;
}

/* ... */

struct paravirt_patch_template pv_ops = {
  /* ... */
  .mmu.pmd_val = ((struct paravirt_callee_save) { _paravirt_ident_64 }),
  /* ... */
};

The generated code is like this:

        .text
        .globl  _paravirt_ident_64              # -- Begin function _paravirt_ident_64
        .p2align        4, 0x90
        .type   _paravirt_ident_64,@function
_paravirt_ident_64:                     # @_paravirt_ident_64
.Lfunc_begin2:
        .loc    0 85 0                          # arch/x86/kernel/paravirt.c:85:0
        .cfi_startproc
# %bb.0:
        #DEBUG_VALUE: _paravirt_ident_64:x <- $rdi
        movq    %rdi, %rax
.Ltmp22:
        .loc    0 86 2 prologue_end             # arch/x86/kernel/paravirt.c:86:2
        xorl    %edi, %edi
.Ltmp23:
        #DEBUG_VALUE: _paravirt_ident_64:x <- $rax
        jmp     __x86_return_thunk              # TAILCALL
.Ltmp24:
.Lfunc_end2:
        .size   _paravirt_ident_64, .Lfunc_end2-_paravirt_ident_64
        .cfi_endproc

As you mentioned, this *should* be okay, as %rdi is caller-saved. However, when it's used in a call, like this:

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) pmdval_t
pmd_val(pmd_t pmd)
{
        return ({
                unsigned long __edi = __edi, __esi = __esi, __edx = __edx,
                              __ecx = __ecx, __eax = __eax;
                ;
                ((void)pv_ops.mmu.pmd_val.func);
                asm volatile("# ALT: oldnstr\n"
                             "661:\n\t"
                             "771:\n\t"
                             "999:\n\t"
                             ".pushsection .discard.retpoline_safe\n\t"
                              ".quad  999b\n\t"
                             ".popsection\n\t"
                             "call *%[paravirt_opptr];\n772:\n"
                             ".pushsection .parainstructions,\"a\"\n"
                              ".balign 8 \n"
                              ".quad  771b\n"
                             "  .byte "
                             "%c[paravirt_typenum]\n"
                             "  .byte 772b-771b\n"
                             "  .short %c[paravirt_clobber]\n"
                             ".popsection\n"
                             "\n662:\n"
                             "# ALT: padding\n"
                             ".skip -(((6651f-6641f)-(662b-661b)) > 0) * ((6651f-6641f)-(662b-661b)),0x90\n"
                             "663:\n"
                             ".pushsection .altinstructions,\"a\"\n"
                             " .long 661b - .\n"
                             " .long 6641f - .\n"
                             " .word ((( 8*32+16)) | (1 << 15))\n"
                             " .byte 663b-661b\n"
                             " .byte 6651f-6641f\n"
                             ".popsection\n"
                             ".pushsection .altinstr_replacement, \"ax\"\n"
                             "# ALT: replacement 1\n"
                             "6641:\n\t"
                             "mov %%rdi, %%rax\n"
                             "6651:\n"
                             ".popsection\n"
                             : "=a"(__eax), "+r"(current_stack_pointer)
                             : [paravirt_typenum] "i"(
                                       (__builtin_offsetof(
                                                struct paravirt_patch_template,
                                                mmu.pmd_val.func) /
                                        sizeof(void *))),
                               [paravirt_opptr] "m"(pv_ops.mmu.pmd_val.func),
                               [paravirt_clobber] "i"(((1 << 0))),
                               "D"((unsigned long)(pmd.pmd))
                             : "memory", "cc");
                ({
                        unsigned long __mask = ~0UL;
                        do {
                                __attribute__((__noreturn__)) extern void
                                __compiletime_assert_29(void) __attribute__((__error__(
                                        "BUILD_BUG_ON failed: "
                                        "sizeof(pmdval_t) > sizeof(unsigned long)")));
                                if (!(!(sizeof(pmdval_t) >
                                        sizeof(unsigned long))))
                                        __compiletime_assert_29();
                        } while (0);
                        switch (sizeof(pmdval_t)) {
                        case 1:
                                __mask = 0xffUL;
                                break;
                        case 2:
                                __mask = 0xffffUL;
                                break;
                        case 4:
                                __mask = 0xffffffffUL;
                                break;
                        default:
                                break;
                        }
                        __mask &__eax;
                });
        });
}

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) pmdval_t
native_pmd_val(pmd_t pmd)
{
        return pmd.pmd;
}       

static inline __attribute__((__gnu_inline__)) __attribute__((__unused__))
__attribute__((no_instrument_function)) int
pmd_none(pmd_t pmd)
{
        unsigned long val = native_pmd_val(pmd);
        return (val & ~((((pteval_t)(1)) << 6) | (((pteval_t)(1)) << 5))) == 0;
}


int pmd_huge(pmd_t pmd)
{
        return !pmd_none(pmd) && (pmd_val(pmd) & ((((pteval_t)(1)) << 0) |
                                                  (((pteval_t)(1)) << 7))) !=
                                         (((pteval_t)(1)) << 0);
}

The code generated is something like this:

        .globl  pmd_huge                        # -- Begin function pmd_huge
        .p2align        4, 0x90
        .type   pmd_huge,@function
pmd_huge:                               # @pmd_huge
.Lfunc_begin0:
        .loc    0 28 0                          # arch/x86/mm/hugetlbpage.c:28:0
        .cfi_sections .debug_frame
        .cfi_startproc
# %bb.0:
        callq   __fentry__
.Ltmp1:
        #DEBUG_VALUE: pmd_huge:pmd <- $rdi
        #DEBUG_VALUE: pmd_none:pmd <- $rdi
        #DEBUG_VALUE: pmd_none:val <- $rdi
        .file   161 "./arch/x86/include/asm" "pgtable.h" md5 0x203fee1b33aab4131e5eb70cbeeb9a3d
        .loc    161 793 41 prologue_end         # ./arch/x86/include/asm/pgtable.h:793:41
        testq   $-97, %rdi
.Ltmp2:
        .loc    0 29 24                         # arch/x86/mm/hugetlbpage.c:29:24
        je      .LBB0_1
.Ltmp3:
# %bb.2:
        #DEBUG_VALUE: pmd_huge:pmd <- $rdi
        #DEBUG_VALUE: pmd_val:pmd <- $rdi
        #DEBUG_VALUE: __edi <- undef
        #DEBUG_VALUE: __esi <- undef
        #DEBUG_VALUE: __edx <- undef
        #DEBUG_VALUE: __ecx <- undef
        #DEBUG_VALUE: __eax <- undef
        .file   162 "./arch/x86/include/asm" "paravirt.h" md5 0x1f771689a20be93009d7ef8574461232
        .loc    162 458 9                       # ./arch/x86/include/asm/paravirt.h:458:9
        #APP
        # ALT: oldnstr
.Ltmp4:
.Ltmp5:
.Ltmp6:
        .section        .discard.retpoline_safe,"",@progbits
        .quad   .Ltmp6
        .text

        callq   *pv_ops+536(%rip)

.Ltmp7:
        .section        .parainstructions,"a",@progbits
        .p2align        3, 0x0
        .quad   .Ltmp5
        .byte   67
        .byte   .Ltmp7-.Ltmp5
        .short  1
        .text


.Ltmp8:
        # ALT: padding
        .zero   (-(((.Ltmp9-.Ltmp10)-(.Ltmp8-.Ltmp4))>0))*((.Ltmp9-.Ltmp10)-(.Ltmp8-.Ltmp4)),144
.Ltmp11:
        .section        .altinstructions,"a",@progbits
.Ltmp12:
        .long   .Ltmp4-.Ltmp12
.Ltmp13:
        .long   .Ltmp10-.Ltmp13
        .short  33040
        .byte   .Ltmp11-.Ltmp4
        .byte   .Ltmp9-.Ltmp10
        .text

        .section        .altinstr_replacement,"ax",@progbits
        # ALT: replacement 1
.Ltmp10:
        movq    %rdi, %rax
.Ltmp9:
        .text


        #NO_APP
        movq    %rax, %rcx
.Ltmp14:
        #DEBUG_VALUE: __eax <- $rax
        .loc    0 30 17                         # arch/x86/mm/hugetlbpage.c:30:17
        andl    $129, %ecx
        .loc    0 30 46 is_stmt 0               # arch/x86/mm/hugetlbpage.c:30:46
        xorl    %eax, %eax
.Ltmp15:
        cmpl    $1, %ecx
        setne   %al
        jmp     .LBB0_3
.Ltmp16:
.LBB0_1:
        #DEBUG_VALUE: pmd_huge:pmd <- $rdi
        #DEBUG_VALUE: pmd_none:pmd <- $rdi
        #DEBUG_VALUE: pmd_none:val <- $rdi
        .loc    0 0 46                          # arch/x86/mm/hugetlbpage.c:0:46
        xorl    %eax, %eax
.Ltmp17:
.LBB0_3:
        #DEBUG_VALUE: pmd_huge:pmd <- $rdi
        .loc    0 29 2 is_stmt 1                # arch/x86/mm/hugetlbpage.c:29:2
        xorl    %ecx, %ecx
        xorl    %edi, %edi
.Ltmp18:
        #DEBUG_VALUE: pmd_huge:pmd <- [DW_OP_LLVM_entry_value 1] $rdi
        jmp     __x86_return_thunk              # TAILCALL
.Ltmp19:
.Lfunc_end0:
        .size   pmd_huge, .Lfunc_end0-pmd_huge
        .cfi_endproc
                                        # -- End function

The important bit is the movq %rdi, %rax right after the callq *pv_ops+536(%rip) instruction. The rdi register isn't preserved across the call, and is in fact zeroed out so the return value, which was already placed in rax by pmd_val, is overwritten by zero. My thinking was that we don't want to zero out registers that aren't modified in the called function, because they're not expected to be stomped on. Perhaps there's something else going on?

The important bit is the movq %rdi, %rax right after the callq *pv_ops+536(%rip) instruction.

I read that clump of assembly as "if paravirtualization is on, call *pv_ops+536(%rip); if it's off, replace the call with movq %rdi, %rax". There isn't any expectation that RDI is preserved.

At least, that's my understanding of the intent. It looks like the asm doesn't mark up its outputs/clobbers correctly, though; it needs to mark all the registers the calling convention says are caller-saved (rcx, rdx, rdi, rsi, r8, r9, r10, r11) as outputs or clobbers. Not sure if that ends up mattering here.

void added a comment.Sep 1 2022, 4:24 PM

The important bit is the movq %rdi, %rax right after the callq *pv_ops+536(%rip) instruction.

I read that clump of assembly as "if paravirtualization is on, call *pv_ops+536(%rip); if it's off, replace the call with movq %rdi, %rax". There isn't any expectation that RDI is preserved.

At least, that's my understanding of the intent. It looks like the asm doesn't mark up its outputs/clobbers correctly, though; it needs to mark all the registers the calling convention says are caller-saved (rcx, rdx, rdi, rsi, r8, r9, r10, r11) as outputs or clobbers. Not sure if that ends up mattering here.

However, if I comment out xorl %edi, %edi in paravirt_ident_64 the kernel boots up. So *something* is expecting rdi's value to be retained. (This isn't the only calling place, of course.)

Looking at the original source file, is pmd_huge getting inlined? If it is, the missing output/clobber operands probably matter.

void added a comment.Sep 1 2022, 4:44 PM

Looking at the original source file, is pmd_huge getting inlined? If it is, the missing output/clobber operands probably matter.

pmd_huge isn't inlined, but other callers are. Let me see if adding proper output/clobbers helps.

void added a comment.Sep 1 2022, 5:20 PM

Looking at the original source file, is pmd_huge getting inlined? If it is, the missing output/clobber operands probably matter.

When I add a "=D" constraint the kernel boots up...*sigh* That may have been it...

void abandoned this revision.Sep 2 2022, 1:03 PM

Looking at the original source file, is pmd_huge getting inlined? If it is, the missing output/clobber operands probably matter.

When I add a "=D" constraint the kernel boots up...*sigh* That may have been it...

I'm going to abandon this change. It looks like the error is in the Linux kernel itself. Thanks for all of your help!