This is an archive of the discontinued LLVM Phabricator instance.

Change the INLINEASM_BR MachineInstr to be a non-terminating instruction.
ClosedPublic

Authored by jyknight on May 12 2020, 9:58 AM.

Details

Summary

Before this instruction supported output values, it fit fairly
naturally as a terminator. However, being a terminator while also
supporting outputs causes some trouble, as the physreg->vreg COPY
operations cannot be in the same block.

Modeling it as a non-terminator allows it to be handled the same way
as invoke is handled already.

Most of the changes here were created by auditing all the existing
users of MachineBasicBlock::isEHPad() and
MachineBasicBlock::hasEHPadSuccessor(), and adding calls to
isInlineAsmBrIndirectTarget or hasInlineAsmBr, as appropriate.

Diff Detail

Event Timeline

jyknight created this revision.May 12 2020, 9:58 AM
Herald added a project: Restricted Project. · View Herald TranscriptMay 12 2020, 9:58 AM
nickdesaulniers added a comment.EditedMay 12 2020, 11:51 AM

An interesting approach. It certainly seems to simplify some of the special handling. I'm happy that it seems to delete much of the existing code. Thank you for taking the time to write up this patch.

I don't fully understand what "shrink wrapping" is, or the changes to BranchFolding, but the rest of the patch looks pretty good to me. This obviously has implications for asm goto without outputs, so I'd like to run this through a couple kernel builds to ensure we haven't regressed anything. From there, I can test kernel builds that use outputs.

Three other things to check:

  1. check we don't spill post terminators (https://reviews.llvm.org/D78166). Probably no issue there.
  2. check that BranchFolder::RemoveDeadBlock isn't removing any MachineBasicBlock that have their address taken. (https://reviews.llvm.org/D78234) tries to do this, but we've seen cases where asm goto w/ outputs results in the indirect successor being removed by BranchFolder. We have a case from the kernel that triggered the above two patches, and is still a problem for https://reviews.llvm.org/D75098 that I can send you.
  3. That live ins match live outs (https://reviews.llvm.org/D78586). Probably no issue there.
llvm/include/llvm/CodeGen/MachineBasicBlock.h
486

I think @efriedma has been noting that the API at the MIR level doesn't feel symmetric with the LLVM IR level.

https://reviews.llvm.org/D78234#1987870 and
https://reviews.llvm.org/D78234#1989382

In LLVM IR, you have BasicBlock::HasAddressTaken, but at the MIR level the operands are still BlockAddress (which reference a Function and BasicBlock, two LLVM IR level concepts). It's too bad we don't lower these to just MachineBasicBlocks (or a new MachineBlockAddress) as operands, and have equivalent machinery for detecting whether a MachineBasicBlock has its address taken.

llvm/include/llvm/Target/Target.td
1023

Delete

llvm/lib/CodeGen/BranchFolding.cpp
1705

That's neat, I didn't know you could use initializer lists as ranges for range based for loops.s

llvm/lib/CodeGen/MachineBasicBlock.cpp
281–284
return any_of(successors(), [](MachineBasicBlock* Succ) {
  return Succ->isInlineAsmBrIndirectTarget(); };

or better yet, why not be precise and iterate terminators() checking the getOpcode() == TargetOpcode::INLINEASM_BR?

llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.h
410

Unused?

llvm/lib/CodeGen/SplitKit.cpp
100

Is this assignment ok to remove?

llvm/lib/Target/Hexagon/BitTracker.cpp
959–960
DefaultToAll |= B.hasInlineAsmBr();
llvm/test/CodeGen/ARM/ifcvt-diamond-unanalyzable-common.mir
66

INLINEASM_BR just disappears from this test?

llvm/test/CodeGen/X86/callbr-asm-outputs.ll
60

does this -1 get overwritten immediately on the next instruction?

An interesting approach. It certainly seems to simplify some of the special handling. I'm happy that it seems to delete much of the existing code. Thank you for taking the time to write up this patch.

This is funny. The original asm goto patches started off with just using INLINEASM and trying to follow exception handling. We switched to INLINEASM_BR as a terminator to "simplify" things.

kparzysz added inline comments.May 12 2020, 12:45 PM
llvm/lib/Target/Hexagon/BitTracker.cpp
958

It would defeat the purpose of this function. It calculates the set of possible targets for each branch, given the updated register states (i.e. branch conditions), which can be a proper subset of the set of targets listed in the branch.

959–960

Ok.

llvm/lib/Target/Hexagon/HexagonConstPropagation.cpp
758

Same reason as in BitTracker.

760

Ok.

817

...

820

Ok.

825

Ok.

void added a comment.May 12 2020, 6:59 PM

I think we should rename INLINEASM_BR to something that doesn't involve the _BR bit, since it's no longer a branch.

In general, I'm okay with this approach. (I thought we shouldn't make it a terminator in the first place.)

llvm/lib/CodeGen/MachineBasicBlock.cpp
281–284

It's no longer a terminator, but could be written:

return any_of(*this, [](const MachineInstr &MI) {
  return MI.getOpcode() == TargetOpcode::INLINEASM_BR;
}
288

Is this correct? We should be able to hoist into a block with INLINEASM_BR if it's coming from the default target.

llvm/lib/CodeGen/SelectionDAG/ScheduleDAGSDNodes.cpp
1031

Yay!

With this series applied (all 3), I can build and boot an x86 defconfig Linux kernel that makes use of asm goto with outputs.

without outputs, just vanilla asm goto:

x86 allyesconfig builds (we don't try to boot allyesconfigs).

aarch32 defconfig kernels build and boot.

aarch64 defconfig kernels build, but they've regressed booting. No panic, just a dead machine. That will take a while to debug.

void added a comment.May 21 2020, 1:04 PM

Quick ping to ask about the status of this change. :-)

jyknight marked an inline comment as done.May 21 2020, 3:05 PM

In the review on the prerequisite patch, it turned out there's another thing that needs to be fixed first before this -- disambiguating whether the end of a block falls through, or is actually unreachable.

I'm working on creating a patch to do that, after which I'll get back to this patch to address the comments here.

jyknight updated this revision to Diff 270500.Jun 12 2020, 1:08 PM

Just rebase. (Doesn't address review comments)

void added inline comments.Jun 12 2020, 11:18 PM
llvm/test/CodeGen/X86/callbr-asm-outputs.ll
60

It's the same on line 52. Is it a problem with the PHI elimination pass?

void added a comment.Jun 14 2020, 9:32 PM

I think you need the following patch. We shouldn't allow instructions to be rescheduled after an INLINEASM_BR, because all values who's definition dominate the INLINEASM_BR are assumed to be valid on the indirect branch.

diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0f21c97a30f..8cbbd79b661 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -443,7 +443,8 @@ static bool isSchedBoundary(MachineBasicBlock::iterator MI,
                             MachineBasicBlock *MBB,
                             MachineFunction *MF,
                             const TargetInstrInfo *TII) {
-  return MI->isCall() || TII->isSchedulingBoundary(*MI, MBB, *MF);
+  return MI->isCall() || TII->isSchedulingBoundary(*MI, MBB, *MF) ||
+      (MI->isInlineAsm() && MI->getOpcode() == TargetOpcode::INLINEASM_BR);
 }
 
 /// A region of an MBB for scheduling.
diff --git a/llvm/test/CodeGen/X86/callbr-asm-outputs.ll b/llvm/test/CodeGen/X86/callbr-asm-outputs.ll
index 61baa31074e..a4447bc15f1 100644
--- a/llvm/test/CodeGen/X86/callbr-asm-outputs.ll
+++ b/llvm/test/CodeGen/X86/callbr-asm-outputs.ll
@@ -41,6 +41,7 @@ define i32 @test2(i32 %out1, i32 %out2) {
 ; CHECK-NEXT:    .cfi_offset %edi, -8
 ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %edi
 ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; CHECK-NEXT:    movl $-1, %eax
 ; CHECK-NEXT:    cmpl %edi, %esi
 ; CHECK-NEXT:    jge .LBB1_2
 ; CHECK-NEXT:  # %bb.1: # %if.then
@@ -49,7 +50,6 @@ define i32 @test2(i32 %out1, i32 %out2) {
 ; CHECK-NEXT:    testl %edi, %esi
 ; CHECK-NEXT:    jne .Ltmp1
 ; CHECK-NEXT:    #NO_APP
-; CHECK-NEXT:    movl $-1, %eax
 ; CHECK-NEXT:    jmp .LBB1_3
 ; CHECK-NEXT:  .LBB1_2: # %if.else
 ; CHECK-NEXT:    #APP
@@ -57,7 +57,6 @@ define i32 @test2(i32 %out1, i32 %out2) {
 ; CHECK-NEXT:    testl %esi, %edi
 ; CHECK-NEXT:    jne .Ltmp2
 ; CHECK-NEXT:    #NO_APP
-; CHECK-NEXT:    movl $-1, %eax
 ; CHECK-NEXT:  .LBB1_3:
 ; CHECK-NEXT:    movl %esi, %eax
 ; CHECK-NEXT:    addl %edi, %eax

@nathanchance tested this (version Diff 270500) and also observed boot failures for mainline linux (v5.8-rc1) for aarch64, but additionally aarch64+5.4 LTS stable, and WSL2 kernel for x86_64. I have only briefly started debugging; there were some unrelated issues with RELR relocations and PAC/BTI aarch64 ISA extensions.

@void 's additional diff fixes the arm64 mainline boot issue I was observing. Maybe @nathanchance can help test with that additional diff applied atop this patch that we're in good shape?

@void 's additional diff fixes the arm64 mainline boot issue I was observing. Maybe @nathanchance can help test with that additional diff applied atop this patch that we're in good shape?

Unfortunately, my test case is still broken. With https://gist.github.com/nathanchance/f69a0281d63d6e72a1449d2f5a98636b on top of 1cfdda57fa63dd6d770ecb4411bd4d2b59e78544, this kernel boots (v5.7):

$ make -skj"$(nproc)" LLVM=1 O=out/x86_64 distclean defconfig bzImage
...

$ ../../cbl/github/boot-utils/boot-qemu.sh -a x86_64 -k out/x86_64 -t 30s |& grep -A1 qemu-system-x86_64
+ timeout --foreground 30s unbuffer qemu-system-x86_64 -cpu host -d unimp,guest_errors -enable-kvm -smp 64 -append 'console=ttyS0 ' -display none -initrd /home/nathan/cbl/github/boot-utils/images/x86_64/rootfs.cpio -kernel /home/nathan/src/linux/out/x86_64/arch/x86_64/boot/bzImage -m 512m -nodefaults -serial mon:stdio
[    0.000000] Linux version 5.7.0 (nathan@ubuntu-n2-xlarge-x86) (ClangBuiltLinux clang version 11.0.0 (https://github.com/llvm/llvm-project 07bde6d2c4646c888b1011aa079bbaaa250f79b8), LLD 11.0.0 (https://github.com/llvm/llvm-project 07bde6d2c4646c888b1011aa079bbaaa250f79b8)) #1 SMP Fri Jun 19 22:07:10 MST 2020

but this one does not:

$ make -skj"$(nproc)" KCFLAGS=-march=znver2 LLVM=1 LOCALVERSION=-znver2 O=out/x86_64 distclean defconfig bzImage
...

$ ../../cbl/github/boot-utils/boot-qemu.sh -a x86_64 -k out/x86_64 -t 30s |& grep -A1 qemu-system-x86_64
+ timeout --foreground 30s unbuffer qemu-system-x86_64 -cpu host -d unimp,guest_errors -enable-kvm -smp 64 -append 'console=ttyS0 ' -display none -initrd /home/nathan/cbl/github/boot-utils/images/x86_64/rootfs.cpio -kernel /home/nathan/src/linux/out/x86_64/arch/x86_64/boot/bzImage -m 512m -nodefaults -serial mon:stdio
+ RET=124

At 1cfdda57fa63dd6d770ecb4411bd4d2b59e78544, both kernels boot without any issue.

$ make -skj"$(nproc)" LLVM=1 O=out/x86_64 distclean defconfig bzImage
...

$ ../../cbl/github/boot-utils/boot-qemu.sh -a x86_64 -k out/x86_64 -t 30s |& grep -A1 qemu-system-x86_64
+ timeout --foreground 30s unbuffer qemu-system-x86_64 -cpu host -d unimp,guest_errors -enable-kvm -smp 64 -append 'console=ttyS0 ' -display none -initrd /home/nathan/cbl/github/boot-utils/images/x86_64/rootfs.cpio -kernel /home/nathan/src/linux/out/x86_64/arch/x86_64/boot/bzImage -m 512m -nodefaults -serial mon:stdio
[    0.000000] Linux version 5.7.0 (nathan@ubuntu-n2-xlarge-x86) (ClangBuiltLinux clang version 11.0.0 (https://github.com/llvm/llvm-project 1cfdda57fa63dd6d770ecb4411bd4d2b59e78544), LLD 11.0.0 (https://github.com/llvm/llvm-project 1cfdda57fa63dd6d770ecb4411bd4d2b59e78544)) #1 SMP Fri Jun 19 22:02:11 MST 2020
$ make -skj"$(nproc)" KCFLAGS=-march=znver2 LLVM=1 LOCALVERSION=-znver2 O=out/x86_64 distclean defconfig bzImage
...

$ ../../cbl/github/boot-utils/boot-qemu.sh -a x86_64 -k out/x86_64 -t 30s |& grep -A1 qemu-system-x86_64
+ timeout --foreground 30s unbuffer qemu-system-x86_64 -cpu host -d unimp,guest_errors -enable-kvm -smp 64 -append 'console=ttyS0 ' -display none -initrd /home/nathan/cbl/github/boot-utils/images/x86_64/rootfs.cpio -kernel /home/nathan/src/linux/out/x86_64/arch/x86_64/boot/bzImage -m 512m -nodefaults -serial mon:stdio
[    0.000000] Linux version 5.7.0-znver2 (nathan@ubuntu-n2-xlarge-x86) (ClangBuiltLinux clang version 11.0.0 (https://github.com/llvm/llvm-project 1cfdda57fa63dd6d770ecb4411bd4d2b59e78544), LLD 11.0.0 (https://github.com/llvm/llvm-project 1cfdda57fa63dd6d770ecb4411bd4d2b59e78544)) #1 SMP Fri Jun 19 22:00:23 MST 2020
void added a comment.Jun 21 2020, 4:27 AM

@nathanchance & @nickdesaulniers: I forgot the post scheduler. Try the patch below.

diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 0f21c97a30f..e28d25bbfae 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -443,7 +443,8 @@ static bool isSchedBoundary(MachineBasicBlock::iterator MI,
                             MachineBasicBlock *MBB,
                             MachineFunction *MF,
                             const TargetInstrInfo *TII) {
-  return MI->isCall() || TII->isSchedulingBoundary(*MI, MBB, *MF);
+  return MI->isCall() || MI->getOpcode() == TargetOpcode::INLINEASM_BR ||
+      TII->isSchedulingBoundary(*MI, MBB, *MF);
 }
 
 /// A region of an MBB for scheduling.
diff --git a/llvm/lib/CodeGen/PostRASchedulerList.cpp b/llvm/lib/CodeGen/PostRASchedulerList.cpp
index b85f00a61ea..1a2a1d753cd 100644
--- a/llvm/lib/CodeGen/PostRASchedulerList.cpp
+++ b/llvm/lib/CodeGen/PostRASchedulerList.cpp
@@ -338,7 +338,8 @@ bool PostRAScheduler::runOnMachineFunction(MachineFunction &Fn) {
       // Calls are not scheduling boundaries before register allocation, but
       // post-ra we don't gain anything by scheduling across calls since we
       // don't need to worry about register pressure.
-      if (MI.isCall() || TII->isSchedulingBoundary(MI, &MBB, Fn)) {
+      if (MI.isCall() || MI.getOpcode() == TargetOpcode::INLINEASM_BR ||
+          TII->isSchedulingBoundary(MI, &MBB, Fn)) {
         Scheduler.enterRegion(&MBB, I, Current, CurrentCount - Count);
         Scheduler.setEndIndex(CurrentCount);
         Scheduler.schedule();
diff --git a/llvm/test/CodeGen/X86/callbr-asm-outputs.ll b/llvm/test/CodeGen/X86/callbr-asm-outputs.ll
index 61baa31074e..a4447bc15f1 100644
--- a/llvm/test/CodeGen/X86/callbr-asm-outputs.ll
+++ b/llvm/test/CodeGen/X86/callbr-asm-outputs.ll
@@ -41,6 +41,7 @@ define i32 @test2(i32 %out1, i32 %out2) {
 ; CHECK-NEXT:    .cfi_offset %edi, -8
 ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %edi
 ; CHECK-NEXT:    movl {{[0-9]+}}(%esp), %esi
+; CHECK-NEXT:    movl $-1, %eax
 ; CHECK-NEXT:    cmpl %edi, %esi
 ; CHECK-NEXT:    jge .LBB1_2
 ; CHECK-NEXT:  # %bb.1: # %if.then
@@ -49,7 +50,6 @@ define i32 @test2(i32 %out1, i32 %out2) {
 ; CHECK-NEXT:    testl %edi, %esi
 ; CHECK-NEXT:    jne .Ltmp1
 ; CHECK-NEXT:    #NO_APP
-; CHECK-NEXT:    movl $-1, %eax
 ; CHECK-NEXT:    jmp .LBB1_3
 ; CHECK-NEXT:  .LBB1_2: # %if.else
 ; CHECK-NEXT:    #APP
@@ -57,7 +57,6 @@ define i32 @test2(i32 %out1, i32 %out2) {
 ; CHECK-NEXT:    testl %esi, %edi
 ; CHECK-NEXT:    jne .Ltmp2
 ; CHECK-NEXT:    #NO_APP
-; CHECK-NEXT:    movl $-1, %eax
 ; CHECK-NEXT:  .LBB1_3:
 ; CHECK-NEXT:    movl %esi, %eax
 ; CHECK-NEXT:    addl %edi, %eax

@void that diff on top of this revision resolves the issue I was seeing, thanks!

+ timeout --foreground 30s unbuffer qemu-system-x86_64 -cpu host -d unimp,guest_errors -enable-kvm -smp 64 -append 'console=ttyS0 ' -display none -initrd /home/nathan/cbl/github/boot-utils/images/x86_64/rootfs.cpio -kernel /home/nathan/src/linux/out/x86_64/arch/x86_64/boot/bzImage -m 512m -nodefaults -serial mon:stdio
[    0.000000] Linux version 5.7.0-znver2 (nathan@ubuntu-n2-xlarge-x86) (ClangBuiltLinux clang version 11.0.0 (https://github.com/llvm/llvm-project a3c12c2214b98696d2f465d9413d91dfcd02160c), LLD 11.0.0 (https://github.com/llvm/llvm-project a3c12c2214b98696d2f465d9413d91dfcd02160c)) #1 SMP Sun Jun 21 12:00:34 MST 2020

@void that diff on top of this revision resolves the issue I was seeing, thanks!

That's great! @void is there a test case we can add for that post RA schedule, so that we don't regress that? @jyknight would you mind rolling @void 's changes up into this and rebasing, please? I'm curious if there's anything we can do to combine isSchedBoundary and isSchedulingBoundary?

void added a comment.Jun 22 2020, 3:10 PM

@void that diff on top of this revision resolves the issue I was seeing, thanks!

That's great! @void is there a test case we can add for that post RA schedule, so that we don't regress that? @jyknight would you mind rolling @void 's changes up into this and rebasing, please? I'm curious if there's anything we can do to combine isSchedBoundary and isSchedulingBoundary?

I'm working on getting the test case ready.

@void that diff on top of this revision resolves the issue I was seeing, thanks!

That's great! @void is there a test case we can add for that post RA schedule, so that we don't regress that? @jyknight would you mind rolling @void 's changes up into this and rebasing, please? I'm curious if there's anything we can do to combine isSchedBoundary and isSchedulingBoundary?

Here's the testcase:

$ more llvm/test/CodeGen/X86/callbr-asm-instr-scheduling.ll
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py
; RUN: llc -mtriple=x86_64-unknown-linux-gnu -verify-machineinstrs -mcpu=znver2 -O2 -frame-pointer=none < %s | FileCheck %s

; Make sure that instructions aren't scheduled after the "callbr". In the
; example below, we don't want the "shrxq" through "leaq" instructions to be
; moved after the "callbr".

%struct.cpuinfo_x86 = type { i8, i8, i8, i8, i32, [3 x i32], i8, i8, i8, i8, i32, i32, %union.anon.83, [16 x i8], [64 x i8], i32, i32, i32, i32, i32, i32, i64, i16, i16, i16, i16, i16, i16, i16, i16, i16, i16, i16, i32, i8, i8 }
%union.anon.83 = type { i64, [72 x i8] }
%struct.pgd_t = type { i64 }
%struct.p4d_t = type { i64 }
%struct.pud_t = type { i64 }

@boot_cpu_data = external dso_local global %struct.cpuinfo_x86, align 8
@page_offset_base = external dso_local local_unnamed_addr global i64, align 8
@pgdir_shift = external dso_local local_unnamed_addr global i32, align 4
@__force_order = external dso_local global i64, align 8
@ptrs_per_p4d = external dso_local local_unnamed_addr global i32, align 4

define i64 @early_ioremap_pmd(i64 %addr) {
; CHECK-LABEL: early_ioremap_pmd:
; CHECK:       # %bb.0: # %entry
; CHECK-NEXT:    #APP
; CHECK-NEXT:    movq %cr3, %rax
; CHECK-EMPTY:
; CHECK-NEXT:    #NO_APP
; CHECK-NEXT:    movabsq $9223372036854771712, %rdx # imm = 0x7FFFFFFFFFFFF000
; CHECK-NEXT:    andq %rax, %rdx
; CHECK-NEXT:    movb {{.*}}(%rip), %al
; CHECK-NEXT:    movq {{.*}}(%rip), %rcx
; CHECK-NEXT:    shrxq %rax, %rdi, %rax
; CHECK-NEXT:    addq %rcx, %rdx
; CHECK-NEXT:    andl $511, %eax # imm = 0x1FF
; CHECK-NEXT:    leaq (%rdx,%rax,8), %rax
; CHECK-NEXT:    #APP
; CHECK-NEXT:  .Ltmp2:
; CHECK-NEXT:    jmp .Ltmp3
; CHECK-NEXT:  .Ltmp4:
; CHECK-NEXT:    .zero (-(((.Ltmp5-.Ltmp6)-(.Ltmp4-.Ltmp2))>0))*((.Ltmp5-.Ltmp6)-(.Ltmp4-.Ltmp2)),144
; CHECK-NEXT:  .Ltmp7:
entry:
  %0 = tail call i64 asm sideeffect "mov %cr3,$0\0A\09", "=r,=*m,~{dirflag},~{fpsr},~{flags}"(i64* nonnull @__force_order)
  %and.i = and i64 %0, 9223372036854771712
  %1 = load i64, i64* @page_offset_base, align 8
  %add = add i64 %and.i, %1
  %2 = inttoptr i64 %add to %struct.pgd_t*
  %3 = load i32, i32* @pgdir_shift, align 4
  %sh_prom = zext i32 %3 to i64
  %shr = lshr i64 %addr, %sh_prom
  %and = and i64 %shr, 511
  %arrayidx = getelementptr %struct.pgd_t, %struct.pgd_t* %2, i64 %and
  callbr void asm sideeffect "1: jmp 6f\0A2:\0A.skip -(((5f-4f) - (2b-1b)) > 0) * ((5f-4f) - (2b-1b)),0x90\0A3:\0A.section .altinstructions,\22a\22\0A .long 1b - .\0A .long 4f - .\0A .word ${1:P}\0A .byte 3b - 1b\0A .byte 5f - 4f\0
A .byte 3b - 2b\0A.previous\0A.section .altinstr_replacement,\22ax\22\0A4: jmp ${5:l}\0A5:\0A.previous\0A.section .altinstructions,\22a\22\0A .long 1b - .\0A .long 0\0A .word ${0:P}\0A .byte 3b - 1b\0A .byte 0\0A .byte 0\0A.previou
s\0A.section .altinstr_aux,\22ax\22\0A6:\0A testb $2,$3\0A jnz ${4:l}\0A jmp ${5:l}\0A.previous\0A", "i,i,i,*m,X,X,~{dirflag},~{fpsr},~{flags}"(i16 528, i32 117, i32 1, i8* getelementptr inbounds (%struct.cpuinfo_x86, %struct.cpuin
fo_x86* @boot_cpu_data, i64 0, i32 12, i32 1, i64 58), i8* blockaddress(@early_ioremap_pmd, %if.end.i), i8* blockaddress(@early_ioremap_pmd, %if.then.i))
          to label %_static_cpu_has.exit.thread.i [label %if.end.i, label %if.then.i]

_static_cpu_has.exit.thread.i:                    ; preds = %entry
  br label %if.end.i

if.then.i:                                        ; preds = %entry
  %4 = bitcast %struct.pgd_t* %arrayidx to %struct.p4d_t*
  br label %p4d_offset.exit

if.end.i:                                         ; preds = %_static_cpu_has.exit.thread.i, %entry
  %coerce.dive.i = getelementptr inbounds %struct.pgd_t, %struct.pgd_t* %arrayidx, i64 0, i32 0
  %5 = load i64, i64* %coerce.dive.i, align 8
  %6 = inttoptr i64 %5 to %struct.p4d_t*
  %7 = load i32, i32* @ptrs_per_p4d, align 4
  %sub.i.i = add i32 %7, 33554431
  %8 = and i32 %sub.i.i, 33554431
  %and.i1.i = zext i32 %8 to i64
  %add.ptr.i = getelementptr %struct.p4d_t, %struct.p4d_t* %6, i64 %and.i1.i
  br label %p4d_offset.exit

p4d_offset.exit:                                  ; preds = %if.end.i, %if.then.i
  %retval.0.i = phi %struct.p4d_t* [ %add.ptr.i, %if.end.i ], [ %4, %if.then.i ]
  %coerce.dive.i12 = getelementptr inbounds %struct.p4d_t, %struct.p4d_t* %retval.0.i, i64 0, i32 0
  %9 = load i64, i64* %coerce.dive.i12, align 8
  %and.i.i13 = and i64 %9, 4503599627366400
  %add.i.i14 = add i64 %and.i.i13, %1
  %10 = inttoptr i64 %add.i.i14 to %struct.pud_t*
  %coerce.dive.i16 = getelementptr %struct.pud_t, %struct.pud_t* %10, i64 511, i32 0
  %11 = load i64, i64* %coerce.dive.i16, align 8
  %tobool.i.i.i = icmp slt i64 %11, 0
  %..i.i.i = select i1 %tobool.i.i.i, i64 4503598553628672, i64 4503599627366400
  ret i64 %..i.i.i
}
jyknight updated this revision to Diff 273860.Jun 26 2020, 4:34 PM
jyknight marked 23 inline comments as done.

Differences in new patch, since previous one:

  • Folded in void's change (but I made it in the isSchedulingBoundary() functions, rather than the two callers.
  • Marked the INLINEASM_BR instruction as always having unmodeled side-effects, rather than only if it's marked thus. (not to fix any known miscompilation -- clang always marked it as such anyhow.)
  • Removed FIXME comments, per review comments.

After working on the previously mentioned unreachable-block-end cleanup for a bit (not done yet), and thinking about this patch more, I actually think it's okay to go ahead with this change now. The potential problem I was worried about is that the code may think that if you have INLINEASM_BR in a basic-block, where the block-end is unreachable, and the block is followed by a basic-block which is one of the indirect targets of the INLINEASM_BR, the code will think the first block falls through, when it does not.

I'm no longer concerned about this now, for two reasons:

  1. I currently _suspect_ it's not actually possible for this to happen, because of the way we restrict handling/merging blocks. It will never be created that way (because callbr is a terminator in IR), and I currently believe we'll never merge the blocks to create that situation, because inlineasm_br block has multiple successors. I am not certain it can never happen, but at least I've not been able to induce it to occur. (This relates to another worry I had previously -- that we might end up with a block that has both a Call instruction that could throw, and an inlineasm_br. Or, for that matter, two inlineasm_br in the same MachineBasicBlock. But, even though it's not a terminator in MI, I believe this will not occur.)
  2. Even if there _is_ some way that might happen, the badness that might occur is limited. Assuming there's a possible fallthrough, when it is in fact impossible, and the block is only an exceptional successor, seems not likely be a problem. We may insert an extraneous jump upon block rearrangement, for example, but that would not be truly problematic.

So, I think this is ready for another round of review and then to submit.

Thanks!

llvm/include/llvm/CodeGen/MachineBasicBlock.h
486

(we do actually have MachineBasicBlock::hasAddressTaken().)

I believe the change I have here resolves efriedma's concerns you've linked, by removing the mapping of source-block to target-block. It now overestimates, but should be conservatively correct, even in the face of instructions being moved/split/etc, because the _target_ cannot be so transformed.

llvm/lib/CodeGen/MachineBasicBlock.cpp
281–284

I'm worried that iterating over all the instructions in the block to retrieve this information would be bad for performance.

I don't think the overestimating here is likely to be a major optimization issue, but if it turns out to be, I'd like to revisit in a future patch.

288

I think it's correct -- in that it returns false strictly more often than it needs to. Just the same as we could sometimes hoist into a block which has an invoke in it, we could sometimes do so for an inlineasm_br.

llvm/lib/CodeGen/SplitKit.cpp
100

The code below which is invoked when LIP.second is set is simply to decide which of LIP.first or LIP.second to return. Letting it take the !LIP.second codepath has the same effect, with marginally less work.

llvm/lib/Target/Hexagon/BitTracker.cpp
958

Thanks for the explanation.

llvm/test/CodeGen/ARM/ifcvt-diamond-unanalyzable-common.mir
66

I changed the test to a different mechanism of testing unanalyzable branch sequences, since INLINEASM_BR doesn't trigger the problem anymore.

void accepted this revision.Jun 26 2020, 6:06 PM

Thanks! :-)

This revision is now accepted and ready to land.Jun 26 2020, 6:06 PM
nickdesaulniers marked an inline comment as done.Jun 29 2020, 2:18 PM

The only other concern I have is whether we have enough test coverage of the scheduling boundary changes? (Does removing the added checks there cause any existing or new test from the change to file? If not, seems like a lack of test coverage.) In the case of PostRASchedulerList, that implies a MIR test (writing MIR tests isn't something I'd wish on my worst enemy though).

llvm/include/llvm/CodeGen/MachineBasicBlock.h
477

Sorry to bikeshed the name of this method, but I find it currently doesn't match the implementation well.

I'd take hasInlineAsmBr to mean "this MachineBasicBlock has an INLINEASM_BR MCInst." Instead, the implementation is checking whether any of the successors of this MachineBasicBlock is the indirect target of an INLINEASM_BR MCInst. Those two aren't the same thing. In fact, you could have a MachineBasicBlock where the current implementation of hasInlineAsmBR would return true, and yet the MachineBasicBlock does not actually contain a INLINEASM_BR MCInst.

I know that's what you're alluding to in the comment, but I think the method should be named hasInlineAsmBrSuccessors or something of the sort. Maybe MaybeHasInlineAsmBr since it sounds like you'd rather it not be precise by scanning all instructions for the relevant opcode? (Couldn't ISEL mark these when creating the MBB? I guess having a bool member is tricky, since transforms would have to update that correctly. In this case, rematerializing the value by rescanning is simpler, and harder to get wrong.)

I guess the name sounds precise, though the comment and implementation denote that it's not, and that's what I'm most hung up on.

llvm/lib/CodeGen/BranchFolding.cpp
1710

was removing the !isEHPad check intentional?

llvm/lib/CodeGen/MachineBasicBlock.cpp
281–284

This could at least use a range-for:

for (const MachineBasicBlock *Succ : successors()) {
  if (Succ->isInlineAsmBrIndirectTarget())
    ...
llvm/lib/CodeGen/MachineVerifier.cpp
584

should the report string be updated, too, to mention INLINEASM_BR fallthrough/default target?

llvm/lib/CodeGen/SplitKit.cpp
108

I wonder why we don't use a reverse const iterator here?

llvm/lib/Target/ARM/ARMBaseInstrInfo.cpp
2020 ↗(On Diff #273860)

I think there's one more override of this method that was missed: AArch64InstrInfo::isSchedulingBoundary in llvm/lib/Target/AArch64/AArch64InstrInfo.cpp? Ah, nvm it defers to TargetInstrInfo::isSchedulingBoundary first.

llvm/test/CodeGen/X86/callbr-asm-outputs.ll
33

@void 's earlier comments mentioned modifications to this test case's movl $-1, %eax. Are those still needed?

void added a comment.Jun 29 2020, 2:32 PM

The only other concern I have is whether we have enough test coverage of the scheduling boundary changes? (Does removing the added checks there cause any existing or new test from the change to file? If not, seems like a lack of test coverage.) In the case of PostRASchedulerList, that implies a MIR test (writing MIR tests isn't something I'd wish on my worst enemy though).

The reason it wasn't failing before was because INLINEASM_BR was a terminator, and the boundary check automatically caught it. The existing test, llvm/test/CodeGen/X86/callbr-asm-outputs.ll, did fail though. So at least we're covered there.

Don't get me wrong, I'm always up for adding more tests, but I think this change is okay in that regard. :-)

void added inline comments.Jun 29 2020, 2:34 PM
llvm/test/CodeGen/X86/callbr-asm-outputs.ll
33

No, those are taken care of by the scheduling boundary check.

jyknight marked 14 inline comments as done.Jun 29 2020, 4:12 PM
jyknight added inline comments.
llvm/include/llvm/CodeGen/MachineBasicBlock.h
477

Renamed to mayHaveInlineAsmBr. I don't like "hasInlineAsmBrSuccessors", because the important property is whether there is such an instruction in the block. That it currently tests via looking at the successor in the current implementation is just an implementation detail.

It would be nice for it to be precise, but as I mentioned in the other thread, it's probably not really worth the expense.

llvm/lib/CodeGen/BranchFolding.cpp
1710

Yes, no longer need to filter out EHpads, because it's now only looking at the analyzeBranch results CurFBB/CurTBB, not all successors.

llvm/lib/CodeGen/MachineVerifier.cpp
584

Maybe it should've before, but this change _removed_ the allowance for that.

llvm/lib/CodeGen/SplitKit.cpp
108

Who knows. =) Changed.

jyknight updated this revision to Diff 274281.Jun 29 2020, 4:12 PM
jyknight marked 4 inline comments as done.

Minor fixups.

nickdesaulniers accepted this revision.Jun 29 2020, 5:16 PM
This revision was automatically updated to reflect the committed changes.
MatzeB added a comment.EditedOct 4 2021, 11:15 AM

So effectively we have an extended basic-block model in the LLVM backend now (blocks have 1 entry, but multiple exits that can be in the middle of the block). I know this was already the case for exception control-flow, but this here cemented it further. With that not being documented anywhere, and being different from LLVM IR/middleend, we are probably asking for trouble...

See also: https://bugs.llvm.org/show_bug.cgi?id=27659

We used to consider the the invoke mechanism more of a bug/misdesign. But I guess without anyone cleaning this up it was just a matter of time for other uses like this to do the same :-(

void added a comment.EditedOct 4 2021, 12:26 PM

So effectively we have an extended basic-block model in the LLVM backend now (blocks have 1 entry, but multiple exits that can be in the middle of the block). I know this was already the case for exception control-flow, but this here cemented it further. With that not being documented anywhere, and being different from LLVM IR/middleend, we are probably asking for trouble...

We took great pains to ensure that nothing could be moved past it, so that the semantics won't change and we can reload registers, etc.. Not being able to 100% model the control flow of an assembly block already existed with current assembly blocks. There's no guarantee that a given assembly block won't have branches, or loops, or calls, or exits. None of which is modeled in MIR.

We took great pains to ensure that nothing could be moved past it, so that the semantics won't change and we can reload registers, etc..

Yes, I see that this is going through great pain to ensure things still work. But this all feels to me like you are making a form of EBBs work here. There is a way to jump out of the basic block without having all instructions in it executed. There must be some cases where COPYs or SPILLs insructions end up after the branch now which brings us into this trouble.

  • I realize that this was painful to model since LLVM was never prepared to have terminator instructions produce values. I guess we lack the time+expertise to fix this after the fact now :-(
  • The design was already messed up before INLINEASM_BR for exception handling.

So I don't know what to do here except rant and predict random bugs popping up over the years because intuition about how basic blocks work is broken.

Not being able to 100% model the control flow of an assembly block already existed with current assembly blocks. There's no guarantee that a given assembly block won't have branches, or loops, or calls, or exits. None of which is modeled in MIR.

Control flow within an assembly instruction is not a problem; Similar to a CALL instruction can also trigger arbitrary control flow, we can ignore that in the modeling since we have a guarnatee that control continues after the instruction eventually.

We took great pains to ensure that nothing could be moved past it, so that the semantics won't change and we can reload registers, etc..

Yes, I see that this is going through great pain to ensure things still work. But this all feels to me like you are making a form of EBBs work here. There is a way to jump out of the basic block without having all instructions in it executed. There must be some cases where COPYs or SPILLs insructions end up after the branch now which brings us into this trouble.

  • I realize that this was painful to model since LLVM was never prepared to have terminator instructions produce values. I guess we lack the time+expertise to fix this after the fact now :-(
  • The design was already messed up before INLINEASM_BR for exception handling.

So I don't know what to do here except rant and predict random bugs popping up over the years because intuition about how basic blocks work is broken.

Any thoughts on how to fix this without regressing support for asm goto w/ outputs?

void added a comment.Oct 4 2021, 1:16 PM

We took great pains to ensure that nothing could be moved past it, so that the semantics won't change and we can reload registers, etc..

Yes, I see that this is going through great pain to ensure things still work. But this all feels to me like you are making a form of EBBs work here. There is a way to jump out of the basic block without having all instructions in it executed. There must be some cases where COPYs or SPILLs insructions end up after the branch now which brings us into this trouble.

This is a general issue with "asm goto", not just with "asm goto with outputs". Someone could stomp on registers, memory, and the like and leap to BFE without the necessary COPY / SPILL instructions to clean things up. We skirt around some of these issues by saying that outputs aren't valid on the indirect branch.

There's a reason why "asm goto" wasn't implemented in LLVM until we were forced to do it...

  • I realize that this was painful to model since LLVM was never prepared to have terminator instructions produce values. I guess we lack the time+expertise to fix this after the fact now :-(
  • The design was already messed up before INLINEASM_BR for exception handling.

So I don't know what to do here except rant and predict random bugs popping up over the years because intuition about how basic blocks work is broken.

One method to perhaps address your concerns would be to create a "terminating" COPY instruction. This allows one to reload registers after the ASM block. There were some sticky issues with it. Going this method was much easier. (Yes, I know that's not the best reason to do something, but...) However, the only issue a hypothetical terminating COPY would solve is issues with instructions invalidly moving past the INLINEASM_BR.

I would *love* to fix this. Chris claims that MLIR would solve all of this. (nudge...nudge) :-)

MatzeB added a comment.Oct 4 2021, 1:40 PM

This is a general issue with "asm goto", not just with "asm goto with outputs". Someone could stomp on registers, memory, and the like and leap to BFE without the necessary COPY / SPILL instructions to clean things up. We skirt around some of these issues by saying that outputs aren't valid on the indirect branch.

Not sure I follow here... if someone stomps on registers in inline-asm (without declaring that in the inputs/outputs) then it's obviously a bug in the input and not something the compiler can fix.

Any thoughts on how to fix this without regressing support for asm goto w/ outputs?

For the record: I am not advocating to revert any patches at this point; I am a obviously a year to late for the review and despite the asm-goto the same problem exists for exception control too. I just feel that we should be aware of what we are doing and call things by their name :)

So unless I am missing something here the whole problem is about terminator instructions producing values. Correct me if I am missing some other aspect of the discussion.

So about supporting outputs in terminator instructions: The tricky part is that regalloc typically expects to be able to place things behind the definition of a value (typically spills or COPYs). The reason this is tricky, is that for a terminator you are forced to place those things into the successor block(s) instead. Placing the operations into the successor blocks in turn can be tricky if the successors have multiple predecessor ("critical edges") because then we can end up executing the instructions wrongfully when coming from the other predecessors. So that would require to either be in a position to split the critical edges as needed or have the register allocator be robust enough to deal with spills/COPYs being executed in more situations than desired in case of critical edges. If I understand the asm-goto construct correctly, then the asm-labels have enough abstraction that breaking critical edges is always possible (you are forced to list all possible targets in the ASM instructions and the compiler could replace 1 block for another, and cannot jump to arbitrary computed destinations like asm-goto, right?).

MatzeB added a comment.Oct 4 2021, 1:44 PM

One method to perhaps address your concerns would be to create a "terminating" COPY instruction. This allows one to reload registers after the ASM block. There were some sticky issues with it. Going this method was much easier. (Yes, I know that's not the best reason to do something, but...) However, the only issue a hypothetical terminating COPY would solve is issues with instructions invalidly moving past the INLINEASM_BR.

But that still would mean we end up with terminator instructions that assign registers, so we're still in this regalloc grey-area because that is technically not supported.

void added a comment.Oct 4 2021, 1:56 PM

One method to perhaps address your concerns would be to create a "terminating" COPY instruction. This allows one to reload registers after the ASM block. There were some sticky issues with it. Going this method was much easier. (Yes, I know that's not the best reason to do something, but...) However, the only issue a hypothetical terminating COPY would solve is issues with instructions invalidly moving past the INLINEASM_BR.

But that still would mean we end up with terminator instructions that assign registers, so we're still in this regalloc grey-area because that is technically not supported.

Yup! You identified the issue we faced. :-)

The design was already messed up before INLINEASM_BR for exception handling.

To me, this is the main point.

Given the existing handling for invoke, I still believe that using the same underlying mechanisms for inlineasm_br was the best implementation choice, because inlineasm_br behaves almost exactly like an invoke with a user-defined custom calling convention.

If you can come up with a proposal to handle invoke differently, then I expect inlineasm_br should fit easily into that.