Tail duplication of a block with an INLINEASM_BR may result in a PHI
node on the indirect branch. This is okay, but it also introduces a copy
for that PHI node *after* the INLINEASM_BR, which is not okay.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Unit Tests
Event Timeline
Hm, it doesn't seem like it should be a problem to duplicate an inlineasm_br instruction. I'm interested to see what's going wrong here.
After sleeping on it, I think it's when a PHI node's created in the indirect destination. We can't handle that, because we assume that any variable used in an indirect path dominates all blocks coming into it. I'll update this bug to reflect that.
I confirmed that this patch does fix the kernel panic observed in https://github.com/ClangBuiltLinux/linux/issues/1125#issuecomment-703890483.
I don't understand the bug though. Looking at the output of:
llc -mtriple=i686-- -stop-after=early-tailduplication -print-after-all llvm/test/CodeGen/X86/tail-dup-asm-goto.ll -o - 2>&1 | less
Before/After IR Dump After Early Tail Duplication before rebuilding clang with this patch applied (so still in a bad state), but nothing looks wrong to me. So what precisely is the bug?
Before early tail dup:
bb.0.bb: successors: %bb.1(0x40000000), %bb.2(0x40000000); %bb.1(50.00%), %bb.2(50.00%) %2:gr32 = IMPLICIT_DEF %0:gr32 = MOV32rm killed %2:gr32, 1, $noreg, 0, $noreg :: (load 4 from `i8** undef`, align 8) %3:gr32_abcd = MOV32r0 implicit-def dead $eflags %4:gr8 = COPY %3.sub_8bit:gr32_abcd TEST8rr %4:gr8, %4:gr8, implicit-def $eflags JCC_1 %bb.2, 5, implicit $eflags JMP_1 %bb.1 bb.1.bb100: ; predecessors: %bb.0 successors: %bb.3(0x80000000); %bb.3(100.00%) %6:gr32 = IMPLICIT_DEF MOV32mi killed %6:gr32, 1, $noreg, 0, $noreg, 0 :: (store 4 into `i8** undef`, align 8) JMP_1 %bb.3 bb.2.bb106: ; predecessors: %bb.0 successors: %bb.3(0x80000000); %bb.3(100.00%) ADJCALLSTACKDOWN32 0, 0, 0, implicit-def dead $esp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $esp, implicit $ssp CALLpcrel32 @foo, <regmask $bh $bl $bp $bph $bpl $bx $di $dih $dil $ebp $ebx $edi $esi $hbp $hbx $hdi $hsi $si $sih $sil>, implicit $esp, implicit $ssp, implicit-def $esp, implicit-def $ssp ADJCALLSTACKUP32 0, 0, implicit-def dead $esp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $esp, implicit $ssp %5:gr32 = MOV32r0 implicit-def dead $eflags bb.3.bb110: ; predecessors: %bb.2, %bb.1 successors: %bb.5(0x80000000), %bb.4(0x00000000); %bb.5(100.00%), %bb.4(0.00%) %1:gr32 = PHI %5:gr32, %bb.2, %0:gr32, %bb.1 INLINEASM_BR &"" [sideeffect] [mayload] [attdialect], $0:[imm], 42, $1:[imm], 0, $2:[imm], blockaddress(@test1, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[ clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags JMP_1 %bb.5 bb.4.bb17.i.i.i (address-taken): ; predecessors: %bb.3 successors: %bb.5(0x80000000); %bb.5(100.00%) bb.5.kmem_cache_has_cpu_partial.exit: ; predecessors: %bb.3, %bb.4 $eax = COPY %1:gr32 RET 0, $eax
After early tail dup:
bb.0.bb: successors: %bb.1(0x40000000), %bb.2(0x40000000); %bb.1(50.00%), %bb.2(50.00%) %2:gr32 = IMPLICIT_DEF %0:gr32 = MOV32rm killed %2:gr32, 1, $noreg, 0, $noreg :: (load 4 from `i8** undef`, align 8) %3:gr32_abcd = MOV32r0 implicit-def dead $eflags %4:gr8 = COPY %3.sub_8bit:gr32_abcd TEST8rr %4:gr8, %4:gr8, implicit-def $eflags JCC_1 %bb.2, 5, implicit $eflags JMP_1 %bb.1 bb.1.bb100: ; predecessors: %bb.0 successors: %bb.5(0x80000000), %bb.4(0x00000000); %bb.5(100.00%), %bb.4(0.00%) %6:gr32 = IMPLICIT_DEF MOV32mi killed %6:gr32, 1, $noreg, 0, $noreg, 0 :: (store 4 into `i8** undef`, align 8) INLINEASM_BR &"" [sideeffect] [mayload] [attdialect], $0:[imm], 42, $1:[imm], 0, $2:[imm], blockaddress(@test1, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags JMP_1 %bb.5 bb.2.bb106: ; predecessors: %bb.0 successors: %bb.5(0x80000000), %bb.4(0x00000000); %bb.5(100.00%), %bb.4(0.00%) ADJCALLSTACKDOWN32 0, 0, 0, implicit-def dead $esp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $esp, implicit $ssp CALLpcrel32 @foo, <regmask $bh $bl $bp $bph $bpl $bx $di $dih $dil $ebp $ebx $edi $esi $hbp $hbx $hdi $hsi $si $sih $sil>, implicit $esp, implicit $ssp, implicit-def $esp, implicit-def $ssp ADJCALLSTACKUP32 0, 0, implicit-def dead $esp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $esp, implicit $ssp %5:gr32 = MOV32r0 implicit-def dead $eflags INLINEASM_BR &"" [sideeffect] [mayload] [attdialect], $0:[imm], 42, $1:[imm], 0, $2:[imm], blockaddress(@test1, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags JMP_1 %bb.5 bb.4.bb17.i.i.i (address-taken): ; predecessors: %bb.1, %bb.2 successors: %bb.5(0x80000000); %bb.5(100.00%) %9:gr32 = PHI %0:gr32, %bb.1, %5:gr32, %bb.2 bb.5.kmem_cache_has_cpu_partial.exit: ; predecessors: %bb.4, %bb.1, %bb.2 %10:gr32 = PHI %9:gr32, %bb.4, %0:gr32, %bb.1, %5:gr32, %bb.2 $eax = COPY %10:gr32 RET 0, $eax
The comment
We assume that all values used on the indirect path
dominates all paths into the indirect block, i.e. don't have PHI nodes.
Doesn't make sense to me. Which PHI node in the above post-early-tail-dup is problematic?
I wonder if there's perhaps a later pass that's messing up, and this change is simply hiding that? Or if duplicating the INLINEASM_BR results in multiple side effects? (Even if the code is never run, the kernel uses .pushsection to squirrel away data in these, so tail duplication is duplicating the data. Is that a problem? Not sure yet.)
Note that we have a PHI node here that passes through the indirect branch (starting at bb.4.bb17.i.i.i) and used in bb.5.kmem_cache_has_cpu_partial.exit.
INLINEASM_BR &"" [sideeffect] [mayload] [attdialect], $0:[imm], 42, $1:[imm], 0, $2:[imm], blockaddress(@test1, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags
JMP_1 %bb.5
The duplication of bb.3.bb110 has the effect that the PHI is now pushed down into the indirect destination block. See below.
bb.4.bb17.i.i.i (address-taken):
; predecessors: %bb.3successors: %bb.5(0x80000000); %bb.5(100.00%)bb.5.kmem_cache_has_cpu_partial.exit:
; predecessors: %bb.3, %bb.4$eax = COPY %1:gr32 RET 0, $eaxAfter early tail dup:bb.0.bb:
successors: %bb.1(0x40000000), %bb.2(0x40000000); %bb.1(50.00%), %bb.2(50.00%) %2:gr32 = IMPLICIT_DEF %0:gr32 = MOV32rm killed %2:gr32, 1, $noreg, 0, $noreg :: (load 4 from `i8** undef`, align 8) %3:gr32_abcd = MOV32r0 implicit-def dead $eflags %4:gr8 = COPY %3.sub_8bit:gr32_abcd TEST8rr %4:gr8, %4:gr8, implicit-def $eflags JCC_1 %bb.2, 5, implicit $eflags JMP_1 %bb.1bb.1.bb100:
; predecessors: %bb.0successors: %bb.5(0x80000000), %bb.4(0x00000000); %bb.5(100.00%), %bb.4(0.00%) %6:gr32 = IMPLICIT_DEF MOV32mi killed %6:gr32, 1, $noreg, 0, $noreg, 0 :: (store 4 into `i8** undef`, align 8) INLINEASM_BR &"" [sideeffect] [mayload] [attdialect], $0:[imm], 42, $1:[imm], 0, $2:[imm], blockaddress(@test1, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags JMP_1 %bb.5bb.2.bb106:
; predecessors: %bb.0successors: %bb.5(0x80000000), %bb.4(0x00000000); %bb.5(100.00%), %bb.4(0.00%) ADJCALLSTACKDOWN32 0, 0, 0, implicit-def dead $esp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $esp, implicit $ssp CALLpcrel32 @foo, <regmask $bh $bl $bp $bph $bpl $bx $di $dih $dil $ebp $ebx $edi $esi $hbp $hbx $hdi $hsi $si $sih $sil>, implicit $esp, implicit $ssp, implicit-def $esp, implicit-def $ssp ADJCALLSTACKUP32 0, 0, implicit-def dead $esp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $esp, implicit $ssp %5:gr32 = MOV32r0 implicit-def dead $eflags INLINEASM_BR &"" [sideeffect] [mayload] [attdialect], $0:[imm], 42, $1:[imm], 0, $2:[imm], blockaddress(@test1, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags JMP_1 %bb.5
The PHI node that was in bb.3.bb110 was moved back into the predecessors: %6:gr32 = IMPLICIT_DEF and %5:gr32 = MOV32r0 implicit-def dead $eflags. (Note that %5.gr32 ... is now with the INLINEASM_BR due to the predecessor and bb.3.bb110 merging.) This doesn't resolve the PHI node though, so it ends up in the indirect destination block (bb.4.bb17.i.i.i). Those values also end up in the PHI node in the default destination.
bb.4.bb17.i.i.i (address-taken):
; predecessors: %bb.1, %bb.2successors: %bb.5(0x80000000); %bb.5(100.00%) %9:gr32 = PHI %0:gr32, %bb.1, %5:gr32, %bb.2
The problem is this PHI here. We don't have a way to resolve PHI nodes in the entry block to an indirect destination. (This is the reason we don't allow "asm goto" outputs on the indirect paths.) When we go to resolve this PHI node, there's no place to put the put the values that works in all situations; we can't split the critical edges and it's not valid to place the resolved instructions at the ends of the predecessor blocks.
bb.5.kmem_cache_has_cpu_partial.exit:
; predecessors: %bb.4, %bb.1, %bb.2%10:gr32 = PHI %9:gr32, %bb.4, %0:gr32, %bb.1, %5:gr32, %bb.2 $eax = COPY %10:gr32 RET 0, $eaxThe comment > We assume that all values used on the indirect path > dominates all paths into the indirect block, i.e. don't have PHI nodes. Doesn't make sense to me. Which PHI node in the above post-early-tail-dup is problematic? I wonder if there's perhaps a later pass that's messing up, and this change is simply hiding that? Or if duplicating the `INLINEASM_BR` results in multiple side effects? (Even if the code is never run, the kernel uses `.pushsection` to squirrel away data in these, so tail duplication is duplicating the data. Is that a problem? Not sure yet.)
Later passes are messing up because tail duplication created a situation they're not able to handle.
%9:gr32 = PHI %0:gr32, %bb.1, %5:gr32, %bb.2
The problem is this PHI here. We don't have a way to resolve PHI nodes in the entry block to an indirect destination. (This is the reason we don't allow "asm goto" outputs on the indirect paths.) When we go to resolve this PHI node, there's no place to put the put the values that works in all situations; we can't split the critical edges and it's not valid to place the resolved instructions at the ends of the predecessor blocks.
We're also not using "asm goto" outputs in this test case, so I still don't understand why this PHI is problematic.
We don't have a way to resolve PHI nodes in the entry block to an indirect destination.
We should be able to catch cases where we have a MachineBasicBlock whose address is taken, yet starts with a phi, right? Then an assert or -verify-machineinstrs check should help avoid producing such cases.
diff --git a/llvm/lib/CodeGen/MachineVerifier.cpp b/llvm/lib/CodeGen/MachineVerifier.cpp index 870b4606e51d..dbd1e71c8619 100644 --- a/llvm/lib/CodeGen/MachineVerifier.cpp +++ b/llvm/lib/CodeGen/MachineVerifier.cpp @@ -2408,6 +2408,11 @@ void MachineVerifier::checkPHIOps(const MachineBasicBlock &MBB) { !PrInfo.isLiveOut(MO0.getReg())) report("PHI operand is not live-out from predecessor", &MO0, I); } + + if (MBB.isInlineAsmBrIndirectTarget()) + for (const MachineInstr &MI: Pre) + if (MI.getOpcode() == TargetOpcode::INLINEASM_BR) + report("PHI in successor of INLINEASM_BR predecassor", &MO1, I); } // Did we see all predecessors?
fails in the existing test cases:
Failed Tests (2): LLVM :: CodeGen/X86/callbr-asm-outputs.ll LLVM :: CodeGen/X86/callbr-asm-phi-placement.ll
Maybe there's a more specific case that's problematic?
I feel like we SHOULD be able to allow tail dup in this basic case:
declare void @foo() declare void @bar() define i32 @baz(i1 %arg_0) { br i1 %arg_0, label %bb1, label %bb2 bb1: call void @foo() br label %bb3 bb2: call void @bar() br label %bb3 bb3: %x = phi i32 [0, %bb1], [1, %bb2] callbr void asm sideeffect "jmp ${2:l}", "i,i,X,~{dirflag},~{fpsr},~{flags}"(i32 42, i1 false, i8* blockaddress(@baz, %bb4)) to label %bb5 [label %bb4] bb4: br label %bb5 bb5: ret i32 %x }
Prior to this patch and James' patch, we would tail duplicate that. With this change, we wont. (It looks to me like we get the codegen of that case correct).
When we go to resolve this PHI node, there's no place to put the put the values that works in all situations; we can't split the critical edges and it's not valid to place the resolved instructions at the ends of the predecessor blocks.
We can (at least: should be able to) handle a PHI node in the indirect targets of an INLINEASM_BR, for any value which is not the output of the INLINEASM_BR itself. We place the required register copies prior to the INLINEASM_BR, rather than after it. This is handled by findPHICopyInsertPoint -- where we did have a bug already, fixed in f7a53d82c0902147909f28a9295a9d00b4b27d38.
I also don't see anything wrong in this test-case. AFAICT, the phi looks to be handled correctly here. Perhaps it's been minimized too far, and is no longer showing the actual problem?
That doesn't matter. It's problematic because of the PHI in the block with the INLINEASM_BR. As I mentioned, the result of the tail duplication is that that PHI is more-or-less moved *down* into the indirect and default destinations. Look at those PHIs and you'll see that they have the same values coming into them.
We don't have a way to resolve PHI nodes in the entry block to an indirect destination.
We should be able to catch cases where we have a MachineBasicBlock whose address is taken, yet starts with a phi, right? Then an assert or -verify-machineinstrs check should help avoid producing such cases.
That's not what's wrong. A PHI is fine in the default branch. It shouldn't exist in the indirect branch, because *the back end can't currently handle it*. And even if it could, the semantics of the asm goto just don't allow for a PHI to exist in the indirect branch.
I disagree. You'll be inserting an instruction that won't be executed on the default path before the asm goto block. While this may be okay in a majority of cases, I don't suspect that it'll be good to rely on it being always okay.
I also don't see anything wrong in this test-case. AFAICT, the phi looks to be handled correctly here. Perhaps it's been minimized too far, and is no longer showing the actual problem?
The testcase is very simplified from the original. The testcase replicates the problematic tail duplication, but not the issue that caused the memory corruption. You can see the .ll files in the bug that's associated with it.
I've explained way too many times what this patch is doing. You and Nick seem to think that this patch is bad for some reason or that it's confusing. If so, then please present one of your own.
Here are the relevant parts of the original issue in get_partial_node. Blocks bb.16.bb100 and bb.17.bb106 are the sources of the values in the PHI node in bb.18.bb110. The bb.19.bb17.i.i.i block is the indirect destination and bb.20.bb113 the default destination.
Before Tail Duplication:
bb.16.bb100: ; predecessors: %bb.15 successors: %bb.18(0x80000000); %bb.18(100.00%) MOV64mr %21:gr64, 1, $noreg, 16, $noreg, %7:gr64 :: (store 8 into %ir.17) JMP_1 %bb.18 bb.17.bb106: ; predecessors: %bb.15 successors: %bb.18(0x80000000); %bb.18(100.00%) ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp %56:gr32 = MOV32r0 implicit-def dead $eflags $rdi = COPY %19:gr64 $rsi = COPY %7:gr64 $edx = COPY %56:gr32 CALL64pcrel32 @put_cpu_partial, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi, implicit $rsi, implicit $edx, implicit-def $rsp, implicit-def $ssp ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp bb.18.bb110: ; predecessors: %bb.17, %bb.16 successors: %bb.20(0x80000000), %bb.19(0x00000000); %bb.20(100.00%), %bb.19(0.00%) %16:gr64 = PHI %4:gr64, %bb.17, %13:gr64, %bb.16 INLINEASM_BR &"1:.byte 0x0f,0x1f,0x44,0x00,0\0A\09.pushsection __jump_table, \22aw\22 \0A\09 .balign 8 \0A\09.long 1b - ., ${2:l} - . \0A\09 .quad ${0:c} + ${1:c} - .\0A\09.popsection \0A\09" [sideeffect] [mayload] [attdialect], $0:[imm], @slub_debug_enabled, $1:[imm], 0, $2:[imm], blockaddress(@get_partial_node, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags, !11 JMP_1 %bb.20 bb.19.bb17.i.i.i (address-taken): ; predecessors: %bb.18 successors: %bb.20(0x7c000000), %bb.21(0x04000000); %bb.20(96.88%), %bb.21(3.12%) TEST32mi %19:gr64, 1, $noreg, 8, $noreg, 2166016, implicit-def $eflags :: (load 4 from %ir.20, align 8) JCC_1 %bb.21, 5, implicit $eflags JMP_1 %bb.20 bb.20.bb113: ; predecessors: %bb.18, %bb.19 successors: %bb.21(0x04000000), %bb.3(0x7c000000); %bb.21(3.12%), %bb.3(96.88%) %57:gr32 = MOV32rm %19:gr64, 1, $noreg, 44, $noreg :: (load 4 from %ir.22) %58:gr32 = SHR32r1 %57:gr32(tied-def 0), implicit-def dead $eflags %59:gr32 = SUB32rr %15:gr32(tied-def 0), killed %58:gr32, implicit-def $eflags JCC_1 %bb.3, 6, implicit $eflags JMP_1 %bb.21
After Tail Duplication:
bb.16.bb100: ; predecessors: %bb.15 successors: %bb.20(0x80000000), %bb.19(0x00000000); %bb.20(100.00%), %bb.19(0.00%) MOV64mr %21:gr64, 1, $noreg, 16, $noreg, %7:gr64 :: (store 8 into %ir.17) INLINEASM_BR &"1:.byte 0x0f,0x1f,0x44,0x00,0\0A\09.pushsection __jump_table, \22aw\22 \0A\09 .balign 8 \0A\09.long 1b - ., ${2:l} - . \0A\09 .quad ${0:c} + ${1:c} - .\0A\09.popsection \0A\09" [sideeffect] [mayload] [attdialect], $0:[imm], @slub_debug_enabled, $1:[imm], 0, $2:[imm], blockaddress(@get_partial_node, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags, !11 %60:gr64 = COPY %13:gr64 JMP_1 %bb.20 bb.17.bb106: ; predecessors: %bb.15 successors: %bb.20(0x80000000), %bb.19(0x00000000); %bb.20(100.00%), %bb.19(0.00%) ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp %56:gr32 = MOV32r0 implicit-def dead $eflags $rdi = COPY %19:gr64 $rsi = COPY %7:gr64 $edx = COPY %56:gr32 CALL64pcrel32 @put_cpu_partial, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit $rdi, implicit $rsi, implicit $edx, implicit-def $rsp, implicit-def $ssp ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp INLINEASM_BR &"1:.byte 0x0f,0x1f,0x44,0x00,0\0A\09.pushsection __jump_table, \22aw\22 \0A\09 .balign 8 \0A\09.long 1b - ., ${2:l} - . \0A\09 .quad ${0:c} + ${1:c} - .\0A\09.popsection \0A\09" [sideeffect] [mayload] [attdialect], $0:[imm], @slub_debug_enabled, $1:[imm], 0, $2:[imm], blockaddress(@get_partial_node, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags, !11 %61:gr64 = COPY %4:gr64 JMP_1 %bb.20 bb.19.bb17.i.i.i (address-taken): ; predecessors: %bb.16, %bb.17 successors: %bb.20(0x7c000000), %bb.21(0x04000000); %bb.20(96.88%), %bb.21(3.12%) %62:gr64 = PHI %60:gr64, %bb.16, %61:gr64, %bb.17 TEST32mi %19:gr64, 1, $noreg, 8, $noreg, 2166016, implicit-def $eflags :: (load 4 from %ir.20, align 8) JCC_1 %bb.21, 5, implicit $eflags JMP_1 %bb.20 bb.20.bb113: ; predecessors: %bb.19, %bb.16, %bb.17 successors: %bb.21(0x04000000), %bb.3(0x7c000000); %bb.21(3.12%), %bb.3(96.88%) %63:gr64 = PHI %62:gr64, %bb.19, %60:gr64, %bb.16, %61:gr64, %bb.17 %57:gr32 = MOV32rm %19:gr64, 1, $noreg, 44, $noreg :: (load 4 from %ir.22) %58:gr32 = SHR32r1 %57:gr32(tied-def 0), implicit-def dead $eflags %59:gr32 = SUB32rr %15:gr32(tied-def 0), killed %58:gr32, implicit-def $eflags JCC_1 %bb.3, 6, implicit $eflags JMP_1 %bb.21
You'll notice that the INLINEASM_BR was duplicated into its predecessors, and bb.17.bb106 was merged with bb.18.bb110 (the block containing the original INLINEASM_BR instruction).
Here's the code after PHI elimination:
bb.16.bb94: ; predecessors: %bb.14 successors: %bb.17(0x30000000), %bb.18(0x50000000); %bb.17(37.50%), %bb.18(62.50%) TEST64rr %4:gr64, %4:gr64, implicit-def $eflags JCC_1 %bb.18, 5, implicit killed $eflags JMP_1 %bb.17 bb.17.bb100: ; predecessors: %bb.16 successors: %bb.20(0x80000000), %bb.19(0x00000000); %bb.20(100.00%), %bb.19(0.00%) MOV64mr %21:gr64, 1, $noreg, 16, $noreg, killed %7:gr64 :: (store 8 into %ir.17) INLINEASM_BR &"1:.byte 0x0f,0x1f,0x44,0x00,0\0A\09.pushsection __jump_table, \22aw\22 \0A\09 .balign 8 \0A\09.long 1b - ., ${2:l} - . \0A\09 .quad ${0:c} + ${1:c} - .\0A\09.popsection \0A\09" [sideeffect] [mayload] [attdialect], $0:[imm], @slub_debug_enabled, $1:[imm], 0, $2:[imm], blockaddress(@get_partial_node, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def dead early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def dead early-clobber $eflags, !11 %60:gr64 = COPY killed %13:gr64 %69:gr64 = COPY %60:gr64 %70:gr64 = COPY killed %60:gr64 JMP_1 %bb.20 bb.18.bb106: ; predecessors: %bb.16 successors: %bb.20(0x80000000), %bb.19(0x00000000); %bb.20(100.00%), %bb.19(0.00%) ADJCALLSTACKDOWN64 0, 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp $rdi = COPY %19:gr64 $rsi = COPY killed %7:gr64 $edx = COPY %26:gr32 CALL64pcrel32 @put_cpu_partial, <regmask $bh $bl $bp $bph $bpl $bx $ebp $ebx $hbp $hbx $rbp $rbx $r12 $r13 $r14 $r15 $r12b $r13b $r14b $r15b $r12bh $r13bh $r14bh $r15bh $r12d $r13d $r14d $r15d $r12w $r13w $r14w $r15w $r12wh and 3 more...>, implicit $rsp, implicit $ssp, implicit killed $rdi, implicit killed $rsi, implicit killed $edx, implicit-def $rsp, implicit-def $ssp ADJCALLSTACKUP64 0, 0, implicit-def dead $rsp, implicit-def dead $eflags, implicit-def dead $ssp, implicit $rsp, implicit $ssp INLINEASM_BR &"1:.byte 0x0f,0x1f,0x44,0x00,0\0A\09.pushsection __jump_table, \22aw\22 \0A\09 .balign 8 \0A\09.long 1b - ., ${2:l} - . \0A\09 .quad ${0:c} + ${1:c} - .\0A\09.popsection \0A\09" [sideeffect] [mayload] [attdialect], $0:[imm], @slub_debug_enabled, $1:[imm], 0, $2:[imm], blockaddress(@get_partial_node, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def dead early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def dead early-clobber $eflags, !11 %61:gr64 = COPY killed %4:gr64 %69:gr64 = COPY %61:gr64 %70:gr64 = COPY killed %61:gr64 JMP_1 %bb.20 bb.19.bb17.i.i.i (address-taken): ; predecessors: %bb.17, %bb.18 successors: %bb.20(0x7c000000), %bb.21(0x04000000); %bb.20(96.88%), %bb.21(3.12%) %62:gr64 = COPY killed %69:gr64 TEST32mi %19:gr64, 1, $noreg, 8, $noreg, 2166016, implicit-def $eflags :: (load 4 from %ir.20, align 8) %70:gr64 = COPY %62:gr64 %71:gr64 = COPY killed %62:gr64 JCC_1 %bb.21, 5, implicit killed $eflags JMP_1 %bb.20 bb.20.bb113: ; predecessors: %bb.19, %bb.17, %bb.18 successors: %bb.21(0x04000000), %bb.5(0x7c000000); %bb.21(3.12%), %bb.5(96.88%) %63:gr64 = COPY killed %70:gr64 %42:gr16 = COPY killed %37.sub_16bit:gr32 %43:gr32 = MOVZX32rr16 killed %42:gr16 %14:gr32 = nsw SUB32rr killed %41:gr32(tied-def 0), killed %43:gr32, implicit-def dead $eflags %15:gr32 = ADD32rr killed %3:gr32(tied-def 0), killed %14:gr32, implicit-def dead $eflags %57:gr32 = MOV32rm %19:gr64, 1, $noreg, 44, $noreg :: (load 4 from %ir.22) %58:gr32 = SHR32r1 killed %57:gr32(tied-def 0), implicit-def dead $eflags CMP32rr %15:gr32, killed %58:gr32, implicit-def $eflags %64:gr32 = COPY killed %15:gr32 %65:gr64 = COPY %63:gr64 %66:gr64 = COPY killed %8:gr64 %71:gr64 = COPY killed %63:gr64 JCC_1 %bb.5, 6, implicit killed $eflags JMP_1 %bb.21
As you can see, the PHI elimination pushed the copy instructions into the wrong places. As I wrote to James, we shouldn't be trying to shove those copy instructions before the INLINEASM_BR, because that's contrary to how we normally resolve PHI nodes. I.e., the instruction is executed *only* if that branch is taken. The back-end relies upon this. So if we allow for PHI node resolution from the indirect branch, and the subsequent instructions are placed before the asm goto block, then it's violating that assumption. Not only is the code slower, but it may result in incorrect execution.
I disagree. You'll be inserting an instruction that won't be executed on the default path before the asm goto block. While this may be okay in a majority of cases, I don't suspect that it'll be good to rely on it being always okay.
Being able to handle a PHI in the indirect target block is required. We can, and _must_ be able to do so. E.g., consider two asm goto statements both having the same block as their indirect destination.
I've explained way too many times what this patch is doing. You and Nick seem to think that this patch is bad for some reason or that it's confusing. If so, then please present one of your own.
The problem is that what the patch is doing is not consistent with the description of the problem. If "PHIs on indirect targets are bad", this patch cannot be the right fix, because in that case, we'd need to actually prevent PHIs in indirect targets, which this doesn't do. It avoids only one way to cause them to be created.
However, looking at your last example, I do believe it's the _description_ which is wrong. That is -- the bug is actually in tail duplication, not with PHIs in indirect targets in general. (phew!) In the "After Tail Duplication" dump, you can see tail duplication has introduced a COPY after the INLINEASM_BR, which gets used in the indirect block. This is obviously not OK -- the COPY should need to be before the INLINEASM_BR in order to be used in IR block bb17.i.i.i.
INLINEASM_BR &"1:.byte 0x0f,0x1f,0x44,0x00,0\0A\09.pushsection __jump_table, \22aw\22 \0A\09 .balign 8 \0A\09.long 1b - ., ${2:l} - . \0A\09 .quad ${0:c} + ${1:c} - .\0A\09.popsection \0A\09" [sideeffect] [mayload] [attdialect], $0:[imm], @slub_debug_enabled, $1:[imm], 0, $2:[imm], blockaddress(@get_partial_node, %ir-block.bb17.i.i.i), $3:[clobber], implicit-def early-clobber $df, $4:[clobber], implicit-def early-clobber $fpsw, $5:[clobber], implicit-def early-clobber $eflags, !11
^^ This instruction can jump to bb.19.bb17.i.i.i
%61:gr64 = COPY %4:gr64
Therefore, this copy ^^ cannot be placed here validly.
JMP_1 %bb.20 bb.19.bb17.i.i.i (address-taken): ; predecessors: %bb.16, %bb.17 successors: %bb.20(0x7c000000), %bb.21(0x04000000); %bb.20(96.88%), %bb.21(3.12%) %62:gr64 = PHI %60:gr64, %bb.16, %61:gr64, %bb.17
^^^ Here's the use.
Here's a slightly modified version of your minimized test-case which exhibits these same invalid copy insertions. (The difference is using the phi variable somewhere else besides just that one phi.) I also got rid of the undefs, just because they seem icky --
declare void @foo() define i8* @test1(i8** %arg1, i8* %arg2) { bb: %i28.i = load i8*, i8** %arg1, align 8 %if = icmp ne i8* %i28.i, %arg2 br i1 %if, label %bb100, label %bb106 bb100: ; preds = %bb store i8* null, i8** %arg1, align 8 br label %bb110 bb106: ; preds = %bb call void @foo() br label %bb110 bb110: ; preds = %bb106, %bb100 %i10.1 = phi i8* [ %arg2, %bb106 ], [ %i28.i, %bb100 ] callbr void asm sideeffect "#$0 $1 $2", "i,i,X,~{dirflag},~{fpsr},~{flags}"(i32 42, i1 false, i8* blockaddress(@test1, %bb17.i.i.i)) to label %kmem_cache_has_cpu_partial.exit [label %bb17.i.i.i] bb17.i.i.i: ; preds = %bb110 br label %kmem_cache_has_cpu_partial.exit kmem_cache_has_cpu_partial.exit: ; preds = %bb110 ret i8* %i10.1 }
In all these sorts of bugs, it's useful to ask why it's not broken with invoke (in case it is!). In this case, it looks like what prevents this bug from occurring for invoke is that EH_LABEL is marked isNotDuplicable = 1, checked a few lines up from your change. So, good.
I think the fix you have here is OK for now, if you fix the test and commentary.
As a follow-on, we could re-allow tail duplication, but actually place the copies in the correct location, by calling llvm::findPHICopyInsertPoint within TailDuplicator::appendCopies.
One thing, I believe we're trying to turn "asm goto" into something it was never meant to be. If you look at gcc's output for this code:
int foo(int a, int b) { int x; if (a < 42) { x = 0; asm goto("# 1\n\t.data %l1" : : "r"(b) : : indirect); } if (b > 42) { x = 37; asm goto("# 2\n\t.data %l1" : : "r"(a) : : indirect); } indirect: x = a + b - x; ret: return x; }
GCC doesn't combine the indirect destination:
.file "asm-goto-phi.c" .text .p2align 4 .globl foo .type foo, @function foo: .LFB0: .cfi_startproc movl %esi, %eax cmpl $41, %edi jg .L2 #APP # 6 "asm-goto-phi.c" 1 # 1 .data .L5 # 0 "" 2 #NO_APP .L2: xorl %edx, %edx cmpl $42, %eax jle .L3 #APP # 10 "asm-goto-phi.c" 1 # 2 .data .L4 # 0 "" 2 #NO_APP .L4: movl $37, %edx .L3: addl %edi, %eax subl %edx, %eax ret .p2align 4,,10 .p2align 3 .L5: xorl %edx, %edx addl %edi, %eax subl %edx, %eax ret .cfi_endproc .LFE0: .size foo, .-foo .ident "GCC: (Debian 9.3.0-14) 9.3.0" .section .note.GNU-stack,"",@progbits
This side steps the issue of resolving PHI nodes.
But whatever. You guys come up with something.
- EDIT ** : Bah, had typed all this out before seeing the past 3 posts...
Thanks for the additional context. I assert that that fuller demonstration does indeed have a bug in tail duplication, one that is not exposed by the current test case, so @jyknight may be correct that the current test case was over reduced. From the After tail duplication linked above:
Take a look at the PHI in bb.19.bb17.i.i.i. The values it gets (via the indirect branch of the INLINEASM_BR) are garbage. Pared down to highlight what I'm referring to:
bb.16: ... INLINEASM_BR ; conditional jump to %bb.19 %60:gr64 = COPY %13:gr64 JMP_1 %bb.20 bb.17: ... INLINEASM_BR ; conditional jump to %bb.19 %61:gr64 = COPY %4:gr64 JMP_1 %bb.20 bb.19: %62:gr64 = PHI %60:gr64, %bb.16, %61:gr64, %bb.17 ...
Either way, if we take the indirect branch (via the assembly within the INLINEASM_BR), whatever the results of the PHI of %62 resolve to are going to be hella wrong, since the COPYs occurred after the INLINEASM_BRs. Instead, the COPYs should have occurred before the INLINEASM_BR.
My dump of the current test case post tail duplication above contains no such COPY related bug. I think if the test case was updated to demonstrate that, then we'd have a starting point to discuss whether we can properly resolve PHIs or not.
That we don't have a -verify-machineinstrs pass that detects that case is also 💩 💩 💩 . We should be able to walk backwards up the use-def chain of any PHI and find a COPY or def, I suppose.
we shouldn't be trying to shove those copy instructions before the INLINEASM_BR, ...
Disagree. Without doing so we've generated garbage.
... because that's contrary to how we normally resolve PHI nodes. The back-end relies upon this.
I'll take your word on that. TBH, I don't understand why the COPY's are necessary...why don't we just transform %62 to %62 = PHI %13, %bb.16, %4, %bb.17? Trying to minimize live ranges, perhaps?
So if we allow for PHI node resolution from the indirect branch, and the subsequent instructions are placed before the asm goto block, then it's violating that assumption.
Not "[all of] the subsequent instructions," can we just move the newly created COPY to be before the INLINEASM_BR?
Not only is the code slower, but it may result in incorrect execution.
How can hoisting COPYs to be above the INLINEASM_BR that might branch to a block whose PHI will use the result lead to incorrect execution? Slower, maybe, but then I think my question about %62 would help there.
Wait, @void, was your above MIR dumped with f7a53d82c0902147909f28a9295a9d00b4b27d38 not applied, perhaps? Say if you had git checkout 4b0aa5724fea for instance to test the results of the initial "INLINEASM_BR is no longer a terminator" change? Oh, TailDuplicator does not use findPHICopyInsertPoint, so it may have exactly the same bug!
Yeah! I was looking at TailDuplicator::appendCopies before thinking "I bet MachineBasicBlock::iterator Loc = MBB->getFirstTerminator(); behaves differently after @jyknight 's 4b0aa5724fea." What if that was updated with sauce from findPHICopyInsertPoint? (It would need to know the successor BB as it's only given the predecessor BB currently). I bet that's it. It would be nice to reuse findPHICopyInsertPoint if possible (we may not want to hoist all COPYs in CopyInfos, but I'm not sure).
In all these sorts of bugs, it's useful to ask why it's not broken with invoke (in case it is!). In this case, it looks like what prevents this bug from occurring for invoke is that EH_LABEL is marked isNotDuplicable = 1, checked a few lines up from your change. So, good.
Ah, that's another thing I was looking at when I said
Or if duplicating the INLINEASM_BR results in multiple side effects? (Even if the code is never run, the kernel uses .pushsection to squirrel away data in these, so tail duplication is duplicating the data. Is that a problem? Not sure yet.)
... That is -- the bug is actually in tail duplication ...
I think the fix you have here is OK for now, if you fix the test and commentary.
As a follow-on, we could re-allow tail duplication, but actually place the copies in the correct location, by calling llvm::findPHICopyInsertPoint within TailDuplicator::appendCopies.
I agree.
llvm/lib/CodeGen/TailDuplicator.cpp | ||
---|---|---|
630–634 | I'd update this to say something to the effect of: TailDuplicator::appendCopies() will erroneously place COPYs post INLINEASM_BR instructions after 4b0aa5724fea, demonstrating the same bug that was fixed in findPHICopyInsertPoint() in f7a53d82c090. FIXME: use findPHICopyInsertPoint() in TailDuplicator::appendCopies() rather than MachineBasicBlock::getFirstTerminator() to find the correct insertion point for the COPY when replacing PHIs. With that comment block updated, the commit message updated, and the altered test case from @jyknight , we'd be all set to land this and close out the clang-11 release. Then we have plenty of time to follow up on the FIXME for clang-12. WDYT? |
Okay. Abandoning this change.
Sorry, maybe I buried the lede too far into the message? "I think the fix you have here is OK" was at the end. :)
One thing, I believe we're trying to turn "asm goto" into something it was never meant to be. If you look at gcc's output for this code.
Certainly, as discussed on the email thread with Segher, Clang/LLVM has interpreted a restriction on the semantics for asm-goto which GCC does _not_ have. LLVM treats asm-goto similar to the block-label and indirect goto features -- and thus requires that a critical edge to the indirect target cannot be split by making a new target block. GCC does not have such a restriction. We can discuss whether we should modify the semantics of asm-goto in Clang, but not part of this bugfix.
This side steps the issue of resolving PHI nodes.
Sure. But, in LLVM, PHIs also appear in a landingpad, which has the exact same behavior. "findPHICopyInsertPoint" being able to place a COPY earlier in a block is not something brand-new for asm-goto. The PHI code is intended to be able to handle this placement, and does so correctly, in this example at least.
LGTM; I did quick ninja check-all, and build+boot tests of the problematic config reported in https://github.com/ClangBuiltLinux/linux/issues/1125, as well as x86_64 and aarch64 build+boot tests of linux-next. Everything LGTM. I'll continue to run more tests overnight just in case, but this is a fairly unobtrusive patch IMO. Thanks for narrowing down on the problematic object file, function, and pass so quickly!
I'd update this to say something to the effect of:
With that comment block updated, the commit message updated, and the altered test case from @jyknight , we'd be all set to land this and close out the clang-11 release. Then we have plenty of time to follow up on the FIXME for clang-12. WDYT?