Jump tables for most targets cannot handle out of range indices by
themselves, so LLVM emits range checks to guard the jump
tables. WebAssembly, on the other hand, implements jump tables using
the br_table instruction, which takes a default branch target as an
operand, making the range checks redundant. This patch introduces a
new MachineFunction pass in the WebAssembly backend to find and
eliminate the redundant range checks.
Details
Diff Detail
- Repository
- rG LLVM Github Monorepo
Event Timeline
llvm/lib/Target/WebAssembly/WebAssemblyFixBrTableDefaults.cpp | ||
---|---|---|
120 | Lies! | |
llvm/test/CodeGen/WebAssembly/cfg-stackify.ll | ||
652–654 | Can this be CHECK-NEXT too now? So we know the br_if is not there? | |
llvm/test/CodeGen/WebAssembly/switch-unreachable-default.ll | ||
8 | So if the default label contains unreachable then we end up in some special case where that target doesn't actually exist in the final output, and we instead just arbitrarily jump to any other the labels? Is this standard llvm behaviour? Is is that default target in this case eliminated by some earlier pass? |
I didn't know about this optimization opportunity! Please make sure to run emscripten tests before landing because it contains some new assumptions.
llvm/lib/Target/WebAssembly/WebAssemblyFixBrTableDefaults.cpp | ||
---|---|---|
46 | Nit: It might help readers if there's a comment that prior to this function, br_table is missing the last (= default) operand, which will be added here | |
48 | Nit: This might be more intuitive: assert(MBB->pred_size() == 1); By the way, is it guaranteed that we always have a single predecessor that is a guard BB, which ends with a conditional branch when the default BB is reachable and no branch when it is unreachable? Are other patterns possible? | |
60 | Is this guaranteed that br_table BB is always a fallthrough, so the operand 0 is the default target? Can they be swapped? | |
65 | Would there be performance implications based on what we choose arbitrarily for the default target, in case there's no guard hints? (I don't know and I'm just asking) If we are unsure, can it be safer to choose the first one, as we are doing now? | |
76 | Why is this process (checking if Succ is a successor of HeaderMBB and deleting it from the successors) necessary? | |
llvm/test/CodeGen/WebAssembly/cfg-stackify.ll | ||
652–654 | Or we can explicitly check if it's not there, like CHECK-NOT: br_if ... |
llvm/lib/Target/WebAssembly/WebAssemblyFixBrTableDefaults.cpp | ||
---|---|---|
60 | If it is somehow guaranteed, it is fine, but if we are unsure, we can compare br_if's argument to MBB to figure it out. |
llvm/lib/Target/WebAssembly/WebAssemblyFixBrTableDefaults.cpp | ||
---|---|---|
46 | Done. | |
48 | Done. There is always a single guard BB that contains a conditional branch to the default target if the default target exists. Then the guard BB either falls through or has an unconditional jump to the jump table BB. | |
60 | They would only be swapped if something in between SelectionDAG building and this pass swapped them, but I'll add an assert in case some later change swaps the order. | |
65 | I suppose it might affect register allocation or something, but I'm not sure. I'll switch to using the first target. | |
76 | We need to remove the shared successors first because otherwise transferSuccessorsAndUpdatePHIs could introduce duplicate successors, which is invalid. I'll add a comment to that effect. | |
120 | Oops! | |
llvm/test/CodeGen/WebAssembly/cfg-stackify.ll | ||
652–654 | Used the CHECK-NOT because IIRC there's a bunch of extra junk in there preventing us from using CHECK-NEXT. | |
llvm/test/CodeGen/WebAssembly/switch-unreachable-default.ll | ||
8 | Correct. The target-independent code doesn't emit any jump to a default target to begin with when the default is unreachable, so on other targets if the default did turn out to be reachable at runtime, then there would be an out-of-bounds access to the jump table. The relevant code that creates the jump to the default block is here: https://github.com/llvm/llvm-project/blob/f027cfa37e6757bb2d17ac3bea944df4e06bcff4/llvm/lib/CodeGen/SelectionDAG/SelectionDAGBuilder.cpp#L2475-L2482. |
Hi,
This made our ubsan bots red. Please fix or revert ASAP
http://lab.llvm.org:8011/builders/sanitizer-x86_64-linux-fast/builds/42256
Same for our downstream bots. When we build llc with gcc 7.4 we get crashes when running the following tests:
08:01:10 Failing Tests (5):
08:01:10 LLVM :: CodeGen/WebAssembly/cfg-stackify.ll
08:01:10 LLVM :: CodeGen/WebAssembly/indirectbr.ll
08:01:10 LLVM :: CodeGen/WebAssembly/stack-insts.ll
08:01:10 LLVM :: CodeGen/WebAssembly/switch-unreachable-default.ll
08:01:10 LLVM :: CodeGen/WebAssembly/switch.ll
For some reason I have not been able to reproduce these failure locally. I am reverting the change pending further investigation. @kcc what compiler version is your bot using?
I was testing the performance of a program with a big switch statement in a loop, a very common pattern in C/C++ and I came across the problem with needless range checks with br_table that this patch is fixing (Sweet! :-))
However, testing the latest version of Emscripten, with this fix, I am finding that Clang now emits "worse" bytecode for my small test program (attached{F12106893}, compiled with emcc -O3 micro-interp.c -o micro-interp.js).
The code is like:
enum Op { A = 0, B, C, D, E, F, G, H, I, J, K, L }; int f(Op* ops, int len) { int result = 0; for (int i = 0; i < len; i++) { Op op = ops[i]; switch (op) { case A: { ... break; } case B: { ... break; } case C: { ... break; } ... default: { ... break; } } } return result; }
Before this change, the function was compiled into this Wasm code, with a single br_table that was not using its default label (see WAT file attached:
):(func (;11;) (type 2) (param i32) (local i32 i32 i32 i32 i32 i64) ... block ;; label = @1 block ;; label = @2 loop ;; label = @3 local.get 4 i32.load16_u local.tee 3 i32.const 11 i32.gt_u br_if 1 (;@2;) block ;; label = @4 block ;; label = @5 block ;; label = @6 block ;; label = @7 block ;; label = @8 block ;; label = @9 block ;; label = @10 block ;; label = @11 block ;; label = @12 block ;; label = @13 block ;; label = @14 local.get 3 br_table 10 (;@4;) 0 (;@14;) 1 (;@13;) 2 (;@12;) 3 (;@11;) 4 (;@10;) 5 (;@9;) 6 (;@8;) 7 (;@7;) 8 (;@6;) 13 (;@1;) 9 (;@5;) 10 (;@4;) end // case 1 end // case 2 end ...
The x64 code jitted by V8 for the switch was reasonably compact, though not-optimal:
00000000BD10FA43 83 460fb73423 movzxwl r14,[rbx+r12*1] 00000000BD10FA48 88 4183fe0b cmpl r14,0xB 00000000BD10FA4C 8c 0f8741020000 ja 00000000BD10FC93 // jmp to default case 00000000BD10FA52 92 4183ee0 subl r14,0x1 00000000BD10FA56 96 458bf6 movl r14,r14 00000000BD10FA59 99 4183fe0b cmpl r14,0xB 00000000BD10FA5D 9d 0f830d000000 jnc 00000000BD10FA70 // jmp to br_table default label 00000000BD10FA63 a3 4c8d1556030000 leaq r10,[rip+0x356] 00000000BD10FA6A aa 43ff24f2 jmp [r10+r14*8] // br_table jump
There were two checks (cmp/jmp), the first for the switch default case, the second for the implementation of br_table.
But now with this change I see this code being generated (see WAT file attached:
)(func (;6;) (type 6) (param i32) (local i32 i32 i32 i32 i64 i32) ... block ;; label = @1 block ;; label = @2 block ;; label = @3 block ;; label = @4 block ;; label = @5 block ;; label = @6 block ;; label = @7 block ;; label = @8 block ;; label = @9 block ;; label = @10 block ;; label = @11 block ;; label = @12 block ;; label = @13 block ;; label = @14 local.get 4 i32.load16_u br_table 1 (;@13;) 0 (;@14;) 2 (;@12;) 3 (;@11;) 4 (;@10;) 5 (;@9;) 6 (;@8;) 7 (;@7;) 8 (;@6;) 9 (;@5;) 13 (;@1;) 10 (;@4;) 12 (;@2;) end i32.const 0 local.set 3 br 10 (;@3;) end i32.const 10 local.set 3 br 9 (;@3;) end i32.const 1 local.set 3 br 8 (;@3;) end i32.const 2 local.set 3 br 7 (;@3;) end i32.const 3 local.set 3 br 6 (;@3;) end i32.const 4 local.set 3 br 5 (;@3;) end i32.const 5 local.set 3 br 4 (;@3;) end i32.const 6 local.set 3 br 3 (;@3;) end i32.const 7 local.set 3 br 2 (;@3;) end i32.const 8 local.set 3 br 1 (;@3;) end i32.const 9 local.set 3 end loop ;; label = @3 block ;; label = @4 block ;; label = @5 block ;; label = @6 block ;; label = @7 block ;; label = @8 block ;; label = @9 block ;; label = @10 block ;; label = @11 block ;; label = @12 block ;; label = @13 block ;; label = @14 block ;; label = @15 block ;; label = @16 block ;; label = @17 block ;; label = @18 block ;; label = @19 block ;; label = @20 block ;; label = @21 block ;; label = @22 block ;; label = @23 block ;; label = @24 block ;; label = @25 local.get 3 br_table 0 (;@25;) 1 (;@24;) 2 (;@23;) 3 (;@22;) 4 (;@21;) 5 (;@20;) 6 (;@19;) 7 (;@18;) 8 (;@17;) 9 (;@16;) 10 (;@15;) 10 (;@15;) end // ... case 0 ... br_table 19 (;@5;) 20 (;@4;) 10 (;@14;) 11 (;@13;) 12 (;@12;) 13 (;@11;) 14 (;@10;) 15 (;@9;) 16 (;@8;) 17 (;@7;) 23 (;@1;) 18 (;@6;) 22 (;@2;) end // ... case 1 br_table 18 (;@5;) 19 (;@4;) 9 (;@14;) 10 (;@13;) 11 (;@12;) 12 (;@11;) 13 (;@10;) 14 (;@9;) 15 (;@8;) 16 (;@7;) 22 (;@1;) 17 (;@6;) 21 (;@2;) end ...
There is a br_table that causes an indirect jump to a stub like i32.const X local.set 3 br Y that jumps to the actual code for the case branches, which all end with other strange br_tables.
Obviously also the native code jitted is much more convoluted. The result is that my small benchmark is 34% slower (33.9 sec vs 25.2).
I see that this issue disappears if I comment away addPass(createWebAssemblyFixBrTableDefaults()); in WebAssemblyPassConfig::addInstSelector() and reintroduce Ops.push_back(DAG.getBasicBlock(MBBs[0])) in WebAssemblyTargetLowering::LowerBR_JT().
Maybe there is something wrong with my configuration, in my machine, but I double-tested this reinstalling the latest Emscripten, and can reliably reproduce the problem.
Could you take a look?
@paolosev Hmm, I suspect that there is a bad interaction between the code with the range check removed and the algorithm that lays out and structures the basic blocks. I will take a closer look!
Also notice, if it can help, that this problem happens when compiling with -O3 but not with -Os.
Nit: It might help readers if there's a comment that prior to this function, br_table is missing the last (= default) operand, which will be added here